Materials and methods relating to cancer diagnosis

ABSTRACT

The invention provides a number of genetic identifiers (genesets) which may be used as diagnostic tools to determine the presence or risk of breast cancer in a patient. The invention also provides genesets which may be used to classify a breast tumour cell as to its molecular subgroup. Each of the identified genesets may be used to product customised specific nucleic acid microarrays for use in diagnosis and classification of breast tumour cells.

The present invention concerns materials and methods for diagnosingcancer, especially breast cancer. Particularly, but not exclusively, theinvention relates to methods and kits for diagnosing the presence orrisk of breast cancer using genetic identifiers.

Carcinoma of the breast is one of the leading causes of death and majorillness amongst female populations worldwide. Despite rapid advances inunderstanding the molecular and genetic events that underlie breastcarcinogenesis and the introduction of clinical screening programs,morbidity and mortality due to this disease unfortunately still remainsat an unacceptably high level. Indeed, for many parts of the world,breast cancer remains one of the fastest growing cancers in local femalepopulations (Chia et al., 2000). One major challenge in the diagnosisand treatment of breast cancer is its clinical and molecularheterogeneity. Individual breast cancers can exhibit tremendousvariations in clinical presentation, disease aggressiveness, andtreatment response (Tavassoli and Schitt, 1992), suggesting that thisclinical entity may actually represent a conglomerate of many differentand distinct cancer subtypes. In addition to variations in clinicalbehaviour, breast cancer can also display strikingly distinct patternsof incidence in different regional and ethnic populations. For example,in Caucasian populations, the majority of breast cancers occurs inpost-menopausal women at a mean and median age of 60 and 61 respectively(Giuliano, 1998). In contrast, studies in Asian populations show abi-modal age of incidence pattern beginning at age 40 (Chia et al.,2000, see discussion). Thus, one outstanding question in tumour biologyis to explain these regional and ethnic differences on the basis ofgenetic or environmental factors, and to ascertain if research findingsobtained using Caucasian populations can be clinically translated toother ethnic populations as well.

Expression profiling using DNA microarrays has recently proved to be anextremely powerful and versatile approach towards the investigation ofmultiple aspects of tumour biology. Previous reports using microarrayson breast cancers have focused on the identification of novel tumoursubtypes, or on the identification of genes that are differentiallyexpressed between known cancer subgroups (Perou et al., 2000, Gruvbergeret al., 2001, Hedenfalk et al., 2001). However, because these studieshave primarily focused on samples obtained primarily from Caucasianpopulations, it is thus an open question if the findings described inthese reports will also apply to breast cancers from other ethnicpopulations. There are also many other key issues also need to beaddressed before the use of molecular profiling can become a clinicalreality. For instance, there are at present almost no published reportswhere the expression signatures and molecular subtypes defined in oneinstitution's study have been independently confirmed in a separateseries from another centre. Such validations are obviously essential,however, as different health-care institutions are likely to differ inmultiple ways which may affect the expression profile of the tumor beingstudied, such as in the surgical handling of tumor samples, choice ofarray technology platform, and patient population base. In addition,because it is usually unfeasible to sample the same tumor over anextended period of time, it is often unclear if the different subtypesdefined using these approaches truly represent distinct biologicalentities, or if they represent a single tumor class in different stagesof clinical evolution. As one example, there are currently conflictingopinions and data in the field on whether estrogen receptor negative (ER−) breast cancers represent biological entities that have directlyarisen from an ER− progenitor cell type in the breast epithelia, or ifthey have ‘evolved’ from an originally ER+ state (Kuukasjarri et al.,1996; Parl 2000; Gruvberger et al, 2001).

To address these issues, the inventors have embarked upon a large-scaleexpression profiling project of breast tumours derived from Asianpatients. First, using a combination of supervised and unsupervisedclustering methods, they have been able to define a small set of geneswhich when used in combination serves as a ‘genetic identifier’ todistinguish if an unknown breast sample is either normal or malignant ina patient of ethnic Chinese descent. The use of such ‘geneticidentifiers’ is of considerable use in the development of moleculardiagnostic assays for specific patient populations. Second, usingprincipal component analysis (PCA), the inventors show that theexpression profiles of normal breast tissues are considerably lessvaried than tumour profiles. This finding supports current models ofbreast tumourigenesis, in which to a first approximation normal breasttissues can be thought of as a relatively constant ‘ground state’, andthat the widely varying expression profiles associated with individualtumours are probably indicative of their arising from this ‘groundstate’ through many different and highly distinct tumourigenic pathways.

Third, by comparing the expression profiles of a series of invasivebreast cancers from Chinese patients to published reports using patientsamples of primarily Caucasian origin, they found that despite severalinter-study methodological differences including choice of arraytechnology platform, many of the key gene signatures and molecularsubtypes were remarkably conserved between the two patient populations,suggesting that the molecular subtypes defined using expression-basedgenomics are indeed highly robust. To the inventors' knowledge, this isthe first cross-institution validation study of this type reported forbreast cancer.

Fourth, by studying the expression profiles of a series of ductalin-situ cancers (ductal carcinoma in situ, or DCIS), they also foundthat DCIS tumors express many of the ‘hallmark’ subtype-specificexpression signatures associated with their invasive counterparts. SinceDCIS cancers currently represent the earliest non-invasive malignantlesion detectable by conventional histopathology, these results suggestthat the molecular subtypes defined in these studies probably arise at arelatively early stage of tumorigenesis (ie pre-invasive) and representdistinct biological entities, rather than a single cancer class indifferent stages of evolution.

Besides providing a molecular framework for the temporal progression ofbreast cancer, the inventors' results also support the feasibility ofusing expression-based genomic technologies for clinical cancerdiagnosis and classification across different health-care institutions.

Thus, at its most general, the present invention provides a newdiagnostic assay for determining the presence or risk of cancer,particularly breast cancer, in a patient using specific geneticidentifiers. Further, the inventors have determined a series ofmulti-gene classifiers for breast cancer.

In the first instance, the inventors have determined a set of 20 genes(a “genetic identifier”) which may be used in combination to predict ifan unknown breast tissue sample is either normal or malignant.

In addition to this first geneset (which can distinguish between tumorand normal breast samples), the inventors have also determined othergenesets which, can be used as genetic identifiers to classify tumoursamples as to subtype. This is of great importance, not only from aresearch standpoint, but also to ensure the most appropriate treatmentis provided.

Thus, the inventors have determined the following genesets which may beused to predict the presence of breast tumour and/or the class oftumour.

-   -   1) The geneset provided in Table 2, which when used as a        combination, allows a user to predict if an unknown breast        tissue sample is either normal or malignant, particularly using        spotted cDNA microarrays.    -   2) A further set of genes (Table 4a and 4b) which when used in        combination can also be used to distinguish between normal and        tumour breast tissue samples. This geneset is more preferably        used on expression profiles obtained using a commercially        available technology platform such as genechips, e.g. Affymetrix        U133A Genechips, but can also be utilized employing the spotted        cDNA microarray technology described in 1).    -   3) A set of genes (Table 5a) which when used in combination can        predict the Estrogen Receptor status of a confirmed breast        tumour sample. A second set of genes (Table 5b) which when used        in combination can predict the ERBB2 status of a confirmed        breast tumour sample.    -   4) A set of genes (Table 6) which when used in combination can        be used to predict the “molecular subtype” of a breast tumour        sample according to the following 5 categories: Luminal, Basal,        ERBB2, Normal-like, and ER-negative subtype II. In this        embodiment of the present invention, the inventors have used two        different types of classification algorithms, namely, (1)        one-vs-all (OVA) support vector machines (SVM); and (2) genetic        algorithm (GA/maximum likelihood discriminant (MLHD) analysis.        Different sets of genes are optimally used depending upon the        type of classification algorithm used. Thus, distinct sets of        genes are described below for each part.    -   5) A set of genes (Table 7) which when used in combination can        be used to predict luminal subclass in Asian breast cancer        patients. The inventors have determined that breast tumours of        the “luminal” variety can be “split” into two distinct subtypes        Luminal A and Luminal D which are clinically relevant. The        genetic identifier (Table 7) is therefore preferably used after        the tumour has been formally recognised as “luminal” in nature.        This of course, can be achieved using the multi-class predictor        of Table 6. The Luminal D tumours are associated with certain        expression signatures that are also found highly aggressive        non-Luminal tumours, e.g. ERBB2 and Basal. This supports the        clinical importance of knowing the tumour subtype.

The determination of specific genesets (genetic identifiers) allowstissue samples to be classified (e.g. tumour v normal) according to theexpression pattern of those genes in the tissue. For example, in thefirst genetic identifier (tumor vs normal) the inventors have determined10 genes that are usually up-regulated in tumour cells relative tonormal cells and 10 genes that are usually down-regulated in tumourcells relative to normal cells. By studying the expression pattern ofthese particular genetic identifiers, i.e. the composite levels ofexpression products of these genes in a test sample, it is possible toclassify the sample as malignant or normal. Thus, the expressionproducts are able to provide an expression profile or “fingerprint” thatcan serve to distinguish between normal and malignant cells.

In a first aspect of the present invention, there is provided a methodof creating a nucleic acid expression profile for a breast tumour cellcomprising the steps of

-   -   (a) isolating expression products from said breast tumour cell        and a normal breast cell;    -   (b) identifying the expression profile of a plurality of genes        selected from Table 2; for both the tumour and normal cell;    -   (c) comparing the expression profile of the tumour cell and the        normal cell; and    -   (d) determining a nucleic acid expression profile characteristic        of a breast tumour cell.

For the purposes of diagnosis, it is important to obtain an expressionprofile that is characteristic of a tumour cell, i.e. distinct from theexpression profile of the equivalent normal cell. The method accordingto the first aspect determines the expression profile of a plurality ofgenes identified by the inventors to be a “genetic identifier” of breasttumour cells (see Table 2).

The expression profile of the individual genes that comprise the geneticidentifier will differ slightly between independent samples. However,the inventors have realised that the expression profile of theseparticular genes that comprise the genetic identifier when used incombination provide a characteristic pattern of expression (expressionprofile) in a tumour cell that is recognisably different from that in anormal cell.

By creating a number of expression profiles of the genetic identifierfrom a number of known tumour or normal samples, it is possible tocreate a library of profiles for both normal and tumour samples. Thegreater the number of expression profiles, the easier it is to create areliable characteristic expression profile standard (i.e. includingstatistical variation) that can be used as a control in a diagnosticassay. Thus, a standard profile may be one that is devised from aplurality of individual expression profiles and devised withinstatistical variation to represent either the tumour or normal cellprofile.

Thus, the method according to the first aspect of the inventioncomprises the steps of

-   -   (a) isolating expression products from a breast tumour cell;        contacting said expression products with a plurality of binding        members capable of specifically and independently binding to        expression products of a plurality of genes selected from Table        2, so as to create a first expression profile of a tumour-cell;    -   (b) isolating expression products from a normal breast cell;        contacting said expression products with the plurality of        binding members used in step (a), so as to create a comparable        second expression profile of a normal breast cell;    -   (c) comparing the first and second expression profiles to        determine an expression profile characteristic of a breast        tumour cell.

The expression products are preferably mRNA, or cDNA made from saidmRNA. Alternatively, the expression product could be an expressedpolypeptide. Identification of the expression profile is preferablycarried out using binding members capable of specifically identifyingthe expression products of genes identified in Table 2. For example, ifthe expression products are cDNA then the binding members will benucleic acid probes capable of specifically hybridising to the cDNA.

Preferably, either the expression product or the binding member will belabelled so that binding of the two components can be detected. Thelabel is preferably chosen so as to be able to detect the relativelevels/quantity and/or absolute levels/quantity of the expressed productso as to determine the expression profile based on the up-regulation ordown-regulation of the individual genes that comprise the geneticidentifiers. In other words, it is preferable that the binding membersare capable of not only detecting the presence of an expression productbut its relative abundance (i.e. the amount of product available).

The determination of the nucleic acid expression profile may becomputerised and may be carried out within certain previously setparameters, to avoid false positives and false negatives.

The computer may then be able to provide an expression profile standardcharacteristic of a normal breast cell and a malignant breast cell asdiscussed above. The determined expression profiles may then be used toclassify breast tissue samples as normal or malignant as a way ofdiagnosis.

Thus, in a second aspect of the invention, there is provided anexpression profile database comprising a plurality of gene expressionprofiles of both normal and malignant breast cells where the genes areselected from Table 2; retrievably held on a data carrier. Preferably,the expression profiles making up the database are produced by themethod according to the first aspect.

With the knowledge of the particular genetic identifiers, it is possibleto devise many methods for determining the expression pattern or profileof the genes in a particular test sample of cells. For example, theexpressed nucleic acid (RNA, mRNA) can be isolated from the cells usingstandard molecular biological techniques. The expressed nucleic acidsequences corresponding to the gene members of the genetic identifiersgiven in Table 2 can then be amplified using nucleic acid primersspecific for the expressed sequences in a PCR. If the isolated expressednucleic acid is mRNA, this can be converted into cDNA for the PCRreaction using standard methods.

The primers may conveniently introduce a label into the amplifiednucleic acid so that it may be identified. Ideally, the label is able toindicate the relative quantity or proportion of nucleic acid sequencespresent after the amplification event, reflecting the relative quantityor proportion present in the original test sample. For example, if thelabel is fluorescent or radioactive, the intensity of the signal willindicate the relative quantity/proportion or even the absolute quantity,of the expressed sequences. The relative quantities or proportions ofthe expression products of each of the genetic identifiers willestablish a particular expression profile for the test sample. Bycomparing this profile with known profiles or standard expressionprofiles, it is possible to determine whether the test sample was fromnormal breast tissue or malignant breast tissue.

Alternatively, the expression pattern or profile can be determined usingbinding members capable of binding to the expression products of thegenetic identifiers, e.g. mRNA, corresponding cDNA or expressedpolypeptide. By labelling either the expression product or the bindingmember it is possible to identify the relative quantities or proportionsof the expression products and determine the expression profile of thegenetic identifiers. In this way the sample can be classified as normalor malignant by comparison of the expression profile with known profilesor standards. The binding members may be complementary nucleic acidsequences or specific antibodies. Microarray assays using such bindingmembers are discussed in more detail below.

In a third aspect of the present invention, there is provided a methodfor determining the presence or risk of breast cancer in a patientcomprising the steps of

-   -   (a) obtaining expression products from breast tissue cells        obtained from a patient suspected of having or at risk of having        breast cancer;    -   (b) contacting said expression products with one or more binding        members capable of detecting the presence of an expression        product corresponding to one or more genes identified in Table        2; and    -   (c) determining the presence or risk of breast cancer in said        patient based on the binding profile of the expression products        from the breast tissue cells to the one or more binding members.

The patient is preferably a woman of Asian descent, e.g. ethnic Chinesedescent.

The step of determining the presence or risk of breast cancer may becarried out by a computer which is able to compare the binding profileof the expression products from the breast tissue cells under test witha database of other previously obtained profiles and/or a previouslydetermined “standard” profile which is characteristic of the presence orrisk of the tumour. The computer may be programmed to report thestatistical similarity between the profile under test and the standardprofiles so that a diagnosis may be made.

As mentioned above, the present inventors have identified several keygenes which have a different expression pattern in tumour cells asopposed to normal cells of the breast. Collectively, these genescomprise a ‘genetic identifier’. The inventors have shown (see below)that the combinatorial expression pattern of the genes belonging to the“genetic identifier” serves to distinguish between normal and tumourcells. Thus, by detecting the expression pattern of the geneticidentifier in a breast tissue sample, it is possible to predict thestate of the cell (normal or malignant) and whether that patient has oris at risk of developing breast cancer.

The genes that comprise the genetic identifier are given in Table 2.There are 20 genes shown, 10 of which are commonly highly expressed intumour cells relative to normal cells and 10 of which commonly havedecreased expression in tumour cells relative to normal cells. Thedifferential expression of the genes was determined using tumourbiopsies and normal tissue biopsies. By detecting the levels ofexpression products of these genes in a test sample, it is possible toclassify the cells as normal or malignant based on the expressionprofile produced, i.e. an increase or decrease in their expression,relative to a standard pattern or profile seen in normal cells.

Thus, in a further aspect of the invention, there is provided a methodof classifying a sample of breast tissue as normal or malignant, saidmethod comprising the steps of

-   -   a) obtaining expression products from the cells of the breast        tissue sample;    -   b) contacting said expression products with a plurality of        binding members capable of specifically binding to the        expression products of a plurality of genes selected from Table        2; and    -   c) classifying the sample as normal or malignant based on the        binding profile of the expression products from the sample and        the binding members.

The sample of breast tissue is preferably from a woman of Asian descent,e.g. ethnic Chinese descent.

As before, the expression product may be a transcribed nucleic acidsequence or the expressed polypeptide. The transcribed nucleic acidsequence may be RNA or mRNA. The expression product may also be cDNAproduced from said mRNA.

The binding member may a complementary nucleic acid sequence which iscapable of specifically binding to the transcribed nucleic acid undersuitable hybridisation conditions. Typically, cDNA or oligonucleotidesequences are used.

Where the expression product is the expressed protein, the bindingmember is preferably an antibody, or molecule comprising an antibodybinding domain, specific for said expressed polypeptide.

The binding member may be labelled for detection purposes using standardprocedures known in the art. Alternatively, the expression products maybe labelled following isolation from the sample under test. A preferredmeans of detection is using a fluorescent label which can be detected bya light meter. Alternative means of detection include electricalsignalling. For example, the Motorola e-sensor system has two probes, a“capture probe” which is freely floating, and a “signalling probe” whichis attached to a solid surface which doubles as an electrode surface.Both probes function as binding members to the expression product. Whenbinding occurs, both probes are brought into close proximity with eachother resulting in the creation of an electrical signal which can bedetected.

As discussed above, the binding members may be oligonucleotide primersfor use in a PCR (e.g. multi-plexed PCR) to specifically amplify thenumber of expressed products of the genetic identifiers. The productswould then be analysed on a gel. However, preferably, the binding membera single nucleic acid probe or antibody fixed to a solid support. Theexpression products may then be passed over the solid support, therebybringing them into contact with the binding member. The solid supportmay be a glass surface, e.g. a microscope slide; beads (Lynx); orfibre-optics. In the case of beads, each binding member may be fixed toan individual bead and they are then contacted with the expressionproducts in solution.

Various methods exist in the art for determining expression profiles forparticular gene sets and these can be applied to the present invention.For example, bead-based approaches (Lynx) or molecular bar-codes(Surromed) are known techniques. In these cases, each binding member isattached to a bead or “bar-code” that is individually readable andfree-floating to ease contact with the expression products. The bindingof the binding members to the expression products (targets) is achievedin solution, after which the tagged beads or bar-codes are passedthrough a device (e.g. a flow-cytometer) and read.

A further known method of determining expression profiles isinstrumentation developed by Illumina, namely, fibre-optics. In thiscase, each binding member is attached to a specific “address” at the endof a fibre-optic cable. Binding of the expression product to the bindingmember may induce a fluorescent change which is readable by a device atthe other end of the fibre-optic cable.

The present inventors have successfully used a nucleic acid microarraycomprising a plurality of nucleic acid sequences fixed to a solidsupport. By passing nucleic acid sequences representing expressed genese.g. cDNA, over the microarray, they were able to create an bindingprofile characteristic of the expression products from tumour cells andnormal cells derived from breast tissue.

The present invention further provides a nucleic acid microarray forclassifying a breast tissue sample as malignant or normal comprising asolid support housing a plurality of nucleic acid sequences, saidnucleic acid sequences being capable of specifically binding toexpression products of one or more genes identified in Table 2. Theclassification of the sample will lead to the diagnosis of breast cancerin a patient. Preferably the solid support will house nucleic acidsequences being capable of specifically and independently binding toexpression products of at least 5 genes, more preferably, at least 10genes or at least 15 genes identified in Table 2. In a most preferredembodiment, the solid support will house nucleic acid sequences beingcapable of specifically and independently binding to expression productsof all 20 genes identified in Table 2.

Typically, high density nucleic acid sequences, usually cDNA oroligonucleotides, are fixed onto very small, discrete areas or spots ofa solid support. The solid support is often a microscopic glass side ora membrane filter, coated with a substrate (or chips). The nucleic acidsequences are delivered (or printed), usually by a robotic system, ontothe coated solid support and then immobilized or fixed to the support.

In a preferred embodiment, the expression products derived from thesample are labelled, typically using a fluorescent label, and thencontacted with the immobilized nucleic acid sequences. Followinghybridization, the fluorescent markers are detected using a detector,such as a high resolution laser scanner. In an alternative method, theexpression products could be tagged with a non-fluorescent label, e.g.biotin. After hybridisation, the microarray could then be ‘stained’ witha fluorescent dye that binds/bonds to the first non-fluorescent label(e.g. fluorescently labelled strepavidin, which binds to biotin).

A binding profile indicating a pattern of gene expression (expressionpattern or profile) is obtained by analysing the signal emitted fromeach discrete spot with digital imaging software. The pattern of geneexpression of the experimental sample can then be compared with that ofa control (i.e. an expression profile from a normal tissue sample) fordifferential analysis.

As mentioned above, the control or standard, may be one or moreexpression profiles previously judged to be characteristic of normal ormalignant cells. These one or more expression profiles may beretrievable stored on a data carrier as part of a database. This isdiscussed above. However, it is also possible to introduce a controlinto the assay procedure. In other words, the test sample may be“spiked” with one or more “synthetic tumour” or “synthetic normal”expression products which can act as controls to be compared with theexpression levels of the genetic identifiers in the test sample.

Most microarrays utilize either one or two fluorophores. For two-colourarrays, the most commonly used fluorophores are Cy3 (green channelexcitation) and Cy5 (red channel excitation). The object of themicroarray image analysis is to extract hybridization signals from eachexpression product. For one-color arrays, signals are measured asabsolute intensities for a given target (essentially for arrayshybridized to a single sample). For two-colour arrays, signals aremeasured as ratios of two expression products, (e.g. sample and control(controls are otherwise known as a ‘reference’)) with differentfluorescent labels.

The microarray in accordance with the present invention preferablycomprises a plurality of discrete spots, each spot containing one ormore oligonucleotides and each spot representing a different bindingmember for an expression product of a gene selected from Table 2. In apreferred embodiment, the microarray will contain 20 spots for each ofthe 20 genes provided in Table 2. Each spot will comprise a plurality ofidentical oligonucleotides each capable of binding to an expressionproduct, e.g. mRNA or cDNA, of the gene of Table 2 it is representing.

In a still further aspect of the present invention, there is provided akit for classifying a breast tissue sample as normal or malignant, saidkit comprising one or more binding members capable of specificallybinding to an expression product of one or more genes identified inTable 2, and a detection means.

Preferably, the one or more binding members (antibody binding domains ornucleic acid sequences e.g. oligonucleotides) in the kit are fixed toone or more solid supports e.g. a single support for microarray orfibre-optic assays, or multiple supports such as beads. The detectionmeans is preferably a label (radioactive or dye, e.g. fluorescent) forlabelling the expression products of the sample under test. The kit mayalso comprise means for detecting and analysing the binding profile ofthe expression products under test.

Alternatively, the binding members may be nucleotide primers capable ofbinding to the expression products of the genes identified in Table 2such that they can be amplified in a PCR. The primers may furthercomprise detection means, i.e. labels that can be used to identify theamplified sequences and their abundance relative to other amplifiedsequences.

The kit may also comprise one or more standard expression profilesretrievably held on a data carrier for comparison with expressionprofiles of a test sample. The one or more standard expression profilesmay be produced according to the first aspect of the present invention.

The present invention further provides a method of diagnosing thepresence or risk of breast cancer in a patient of Asian descent, saidmethod comprising

-   -   obtaining a breast tissue sample;    -   isolating expression products from said sample;    -   labelling said expression products;    -   contacting said labelled expression products with a plurality of        binding members representing a plurality of genes selected from        Table 2;    -   determining the presence or risk of breast cancer in said        patient, based on the binding profile of said labelled        expression products and the binding members.

The breast tissue sample may be obtained as excisional breast biopsiesor fine-needle aspirates.

Again, the expression products are preferably mRNA or cDNA produced fromsaid mRNA. The binding members are preferably oligonucleotides fixed toone or more solid supports in the form of a microarray or beads (seeabove). The binding profile is preferably analysed by a detector capableof detecting the label used to label the expression products. Thedetermination of the presence or risk of breast cancer can be made bycomparing the binding profile of the sample with that of a control e.g.standard expression profiles.

In all of the aspects described above, it is preferred to use bindingmembers capable of specifically binding (and, in the case of nucleicacid primers, amplifying) expression products of all 20 geneticidentifiers. This is because the expression levels of all 20 genes makeup the expression profile specific for the cells under test. Theclassification of the expression profile is more reliable the greaternumber of gene expression levels tested. Thus, preferably expressionlevels of more than 5 genes selected from Table 2 are assessed, morepreferably, more than 10, even more preferably, more than 15 and mostpreferably all 20 genes.

The genetic identifier (Table 2) mentioned above is particularlysuitable for spotted cDNA microarray technology where the microarray (orother similar technology) has been created specifically for thispurpose. However, the present inventors have appreciated that thepresent invention may be modified so that commercially availablegenechips may be used, rather than going to the trouble of creating onespecifically containing the genes identified in Table 2. With this inmind, the inventors have identified a further genetic identifier (Table5a or 5b) which, although it may be utilized using microarray technologydescribed above, it may also be used on commercially availablegenechips, e.g. Affymetrix U133A Genechips.

Thus, the aspects of the invention described above may also be carriedout using the geneset of Table 4a or 4b instead of that of Table 2 andin addition these may be used on either on commercially availablegenechips such as Affymetrix U133A Genechips, or using microarraytechnology described above.

The present inventors have also identified a further set of genes (Table5a) which may be used to classify a breast tumour on the basis of theEstrogen Receptor (ER) status. This is clinically important as ER⁺tumours can be treated with hormonal therapies (e.g. tamoxifen) and ER⁻tumours are typically more aggressive and refractory to treatment.

Likewise, the present inventors have also identified a further set ofgenes (Table 5b) which may be used to classify a breast tumour on thebasis of the ERBB2+ status. Knowing the ERBB2⁺ status of a breast tumouris also clinically important as ERBB2⁺ tumours are typically highlyaggressive and carry a poor clinical prognosis. ERBB2+ tumors are alsocandidates for treatment with Herceptin (an anti-cancer drug).

The genesets provided in Tables 5a and 5b were determined by generatingexpression profiles for a set of breast tumour samples using AffymetrixU133A Genechips. A series of statistical algorithms were used toidentify a set of genes that were differentially expressed in ER⁺ vs ER⁻samples as well as ERBB2⁺ vs ERBB2⁻ samples. Accordingly, the presentinvention further provides genesets which may be used in methods ofclassifying breast tumours according to ER and ERBB2 status.

Thus, in a further aspect of the present invention, there is provided amethod of classifying a breast tumour according to its ER and/or ERBB2status comprising.

-   -   a) obtaining expression products from the tumour cells;    -   b) contacting said expression products with a plurality of        binding members capable of specifically binding to the        expression products of a plurality of genes selected from Table        5; and    -   c) classifying the tumour cell on the basis of ER and/or ERBB2        status based on the binding profile of the expression products        from the sample and the binding members.

As with the first aspect of the present invention, the plurality ofbinding members are preferably nucleic acid sequences and morepreferably nucleic acid sequences fixed to a solid support, for exampleas a nucleic acid microarray. The nucleic acid sequences may beoligonucleotide probes or cDNA sequences.

The tumour cell may be classified according to its ER and/or ERBB2status on the basis of the expression of the genes identified in Table5. Table 5 identifies each gene as either being upregulated (+) or downregulated (−) in an ER⁺ or ERBB2⁺ tumour. With this information, it ispossible to determine whether the breast tumour cell under test is ER⁻or ER⁺ and/or ERBB2⁺ or ERBB2⁻.

As with all aspects of the present invention, the plurality of genesselected from the determined genesets (Tables 2-7 with the exception ofTable 6b) may vary in actual number. It is preferable to use at least 5genes, more preferably at least 10 genes in order to carry out theinvention. Of course, the known microarray and genechip technologiesallow large numbers of binding members to be utilized. Therefore, themore preferred method would be to use binding members representing allof the genes in each geneset. However, the skilled person willappreciate that a proportion of these genes may be omitted and themethod still carried out in a reliable and statistically accuratefashion. In most cases, it would be preferable to use binding membersrepresenting at least 70%, 80% or 90% of the genes in each respectivegeneset.

In a further aspect of the invention, there is provided a method ofclassifying a breast tumour cell as to its molecular subtype comprising

-   -   a) obtaining expression products from the tumour cells;    -   b) contacting said expression products with a plurality of        binding members capable of specifically binding to the        expression products of a plurality of genes selected from Table        6; and    -   c) classifying the tumour cell with regard to its molecular        subtype based on the binding profile of the expression products        from the tumour cell and the binding members.

The molecular subtypes are preferably Luminal, ERBB2, Basal, ER-type IIand Normal/normal like. These sub-types are defined in the followingtext.

In practice, the expression profile of the tumour sample to beclassified is determined using the genesets described in Table 6 (Table6a or 6b depends on the type of classification algorithm used).Secondly, the expression profile would be compared to a database of“references” (control profiles, where each “reference” (control)profiles, where each “reference” profile corresponds to the “average”tumour belonging to that particular molecular type. In this case, ratherthan just having normal and tumour, or ER⁺ and ER⁻, the “reference”profiles will correspond to five distinct subtypes. Third, by using asuitable classification algorithm, the unknown tumour sample can beassigned to the specific subtype for which the expression profile findsa good reference match.

Where the plurality of binding members are selected as being capable ofbinding to the expression products of a plurality of genes from Table6a, the number of binding members used will govern the reliability ofthe test. In other words, it is not necessary to use binding memberscapable of specifically and independently to all genes identified inTable 6a, but the more binding members used, the better the test.Therefore, by plurality it is meant preferably at least 50%, morepreferably at least 70% and even more preferably at least 90% of thegenes as mentioned above.

In a still further aspect of the invention, there is provided a methodof further sub-classifying a breast tumour cell as either luminal A orluminal D subtype comprising

-   -   a) obtaining expression products from the tumour cells;    -   (b) contacting said expression-products with a plurality of        binding members capable of specifically binding to the        expression products of a plurality of genes selected from Table        7; and    -   c) classifying the tumour cell with regard to its molecular        subtype based on the binding profile of the expression products        from the tumour cell and the binding members.

Preferably, the method is carried out on expression products obtainedfrom a breast tumour cell which has already been classified as“luminal”, e.g. using the genetic identifier of Table 6a or 6b.

With regard to the geneset provided in Table 6b, it is preferable thatall of the genes in the geneset are used for classification. Thereduction in the number of genes will take away the likelihood of areliable result. This is because this geneset is selected using thegenetic algorithm approach.

The inventors have provided a number of genetic identifiers (Tables 2 to7) which can be used to diagnose and/or predict risk of breast cancerand, further, can be used to classify the type of breast cancer,particularly for women of Asian descent.

The provision of these genetic identifiers allows diagnostic tools, e.g.nucleic acid microarrays to be custom made and used to predict, diagnoseor subtype tumours. Further, such diagnostic tools may be used inconjunction with a computer which is programmed to determine theexpression profile obtained using the diagnostic tool (e.g. microarray)and compare it to a “standard” expression profile characteristic ofnormal v tumour and/or molecular subtypes depending on the particulargenetic identifier used. In doing so, the computer not only provides theuser with information which may be used diagnose the presence or type ofa tumour in a patient, but at the same time, the computer obtains afurther expression profile by which to determine the “standard”expression profile and so can update its own database.

Thus, the invention allows, for the first time, specialized chips(microarrays) to be made containing probes corresponding to the genesetsidentified in Tables 2 to 7. The exact physical structure of the arraymay vary and range from oligonucleotide probes attached to a2-dimensional solid substrate to free-floating probes which have beenindividually “tagged” with a unique label, e.g. “bar code”.

A database corresponding to the various biological classifications (e.g.normal, tumour, molecular subtype etc.) may be created which willconsist of the expression profiles of various breast tissues asdetermined by the specialized microarrays. The database may then beprocessed and analysed such that it will eventually contain (i) thenumerical data corresponding to each expression profile in the database,(ii) a “standard” profile which functions as the canonical profile forthat particular classification; and (iii) data representing the observedstatistical variation of the individual profiles to the “standard”profile.

In practice, to evaluate a patient's sample, the expression products ofthat patient's breast cells (obtained via excisional biopsy or findneedle aspirate) will first be isolated, and the expression profile ofthat cell determined using the specialized microarray. To classify thepatient's sample, the expression profile of the patient's sample will bequeried against the database described above. Querying can be done in adirect or indirect manner. The “direct” manner is where the patient'sexpression profile is directly compared to other individual expressionprofiles in the database to determined which profile (and hence whichclassification) delivers the best match. Alternatively, the querying maybe done more “indirectly”, for example, the patient expression profilecould be compared against simply the “standard” profile in the database.The advantage of the indirect approach is that the “standard” profiles,because they represent the aggregate of many individual profiles, willbe much less data intensive and may be stored on a relativelyinexpensive computer system which may then form part of the kit (i.e. inassociation with the microarrays) in accordance with the presentinvention. In the direct approach, it is likely that the data carrierwill be of a much larger scale (e.g. a computer server) as manyindividual profiles will have to be stored.

By comparing the patient expression profile to the standard profile(indirect approach) and the pre-determined statistical variation in thepopulation, it will also be possible to deliver a “confidence value” asto how closely the patient expression profile matches the “standard”canonical profile. This value will provide the clinician with valuableinformation on the trustworthiness of the classification, and, forexample, whether or not the analysis should be repeated.

As mentioned above, it is also possible to store the patient expressionprofiles on the database, and these may be used at any time to updatethe database.

Aspects and embodiments of the present invention will now beillustrated, by way of example, with reference to the accompanyingfigures. Further aspects and embodiments will be apparent to thoseskilled in the art. All documents mentioned in this text areincorporated herein by reference

FIG. 1: Unsupervised Partitioning of Normal and Tumour Breast Samples.Individual expression profiles were subjected to standard data selectionfilters (see text), and the resultant data matrix, comprisingapproximately 800 array targets, was sorted using hierarchicalclustering. Normal samples (‘xxxN’) are underlined, while tumour samples(‘xxxT’) are not. Numbers represent the NCC Tissue Repository numbersassociated with each sample. The dendogram branches illustrate theextent of similarity between the biological samples. Normal and Tumoursamples segregate independently, but only at secondary levels of thedendogram. Minor variations on the data filters used to select this dataset also yielded highly similar dendograms (P. Tan, unpublishedobservations)

FIG. 2: Improvement of Normal and Tumour Sample Partitioning UsingCombined Outlier Genesets (COG). (A) Independent outlier genesets fornormal (left) and tumour (right) samples were defined. Each clustergramconsists of a matrix of array targets (rows) by biological samples(columns), and light grey represents upregulation, while dark greyrepresents downregulation (see Materials and Methods for selectioncriteria). The outlier geneset for normal samples consists of 60 genes,while the outlier geneset for tumour samples consists of 75 genes.Specific normal and tumour samples used in the establishment of theoutlier genesets are listed below each clustergram. Underlined samplenumbers indicate reciprocal hybridizations, where the tumour/normalsample was labelled using Cy5 and the reference sample Cy3. (B)Partitioning of normal and tumour samples using the COG. The 108 uniquearray targets comprising the COG were used to segregate the tumour andnormal samples from FIG. 1 using standard hierarchical clustering. Incontrast to FIG. 1, division of the normal (xxxN) and tumour (xxxT)samples is now observed as a primary class division, with 2misclassifications.

FIG. 3: Partitioning of Normal and Tumour Samples using a Minimal20-Element Genetic Identifier. The 20 array targets from the COG (Table2) that were most highly correlated to the tumour/normal classdistinction were used to segregate (A) the training set from FIGS. 1 and2 b, and (B) a naïve test set of 10 normals and 11 tumours. In bothcases, accurate segregation of normal and tumour samples at the level ofthe primary class division can be observed.

FIG. 4: Comparison of expression profile variation in normal and tumoursamples. Independent normal and tumour datasets were established usingthe combined samples of FIGS. 3 a and 3 b (total=48 samples). Using PCA,the entire gene expression matrix of approximately 8000 array targets inthese datasets were reduced to basic principal components. The extent ofvariance of each component normalized to the 1^(st) component(normalized eigenvalue) is depicted on the y-axis, and the principalcomponent number on the x-axis, beginning with the 2^(nd) component(since the first component of each set is 1). To observe the rate of‘decay’ of information, the components for each dataset are depicted indecreasing order of variance. Normal samples consistently exhibit alower information decay rate across their components compared withtumours.

FIG. 5: Gene expression patterns of 62 samples including 56 carcinomasand 6 normal tissues, analyzed by hierarchical clustering usingdifferent gene sets. Samples were divided into 6 subtypes based ondifferences in gene expression (legend), and are: Luminal, (S1);ERBB2+/ER+ (S2, ERBB2+/er− (S3), Basal-like (S4), ER negative subtype II(S5), and Normal/Normal-like (S6)

(a) Unsupervised hierarchical clustering using a dataset of 1796 genes.The gray underline indicates a cluster which contains a mixture ofLuminal and ERBB2+/ER+ samples. (b) Semi-supervised hierarchicalclustering using the ‘common intrinsic gene set’ (CIS, 292 genes). (c)The full cluster diagram using the CIS. Shaded bars to the right of theclustergram represent gene clusters A-E (Table 3), and are (A) Luminalepithelial genes with ER. (B) ‘Novel’ genes. (C) Basal epithelial genes.(D) Normal breast-like genes. (E) ERBB2-related genes.

FIG. 6(a)-(d) Representative Examples of DCIS Samples Used in thisStudy. Two samples are shown (a)/(b), and (c)/(d) The DCIS status ofeach sample was confirmed both by examination of paraffin H & E sectionsof samples ((a) and (c), HE), as well as frozen cryosections ((b) and(d), FS) of the actual sample that was processed for expressionprofiling. (e) ‘Distinct Origins’ and ‘Evolutionary’ Theories of BreastCancer Development. The ‘Distinct Origins’ hypothesis proposes thatdifferent molecular subtypes of cancer arise via different tumorigenicpathways, and thus constitute distinct biological entities. The‘Evolutionary’ hypothesis proposes that the different molecular subtypesarise as a result of a single (or a few) cancer classes undergoingdifferent stages of phenotypic development. One cannot distinguishbetween the two hypotheses by only studying advanced invasive cancersobtained at a single point in time.

FIG. 7: DCIS samples express the hallmark genes of advanced carcinomasubtypes. DCIS samples are shown as dark vertical lines. Based upon theCIS geneset, six out of twelve DCIS samples cluster within theERBB2+groups (S2 and S3), 5 samples in the Luminal group, and one samplewas in the normal-like group. Shaded bars to the right of theclustergram represent the same gene clusters as shown in FIG. 5. (A)Luminal epithelial genes with ER. (B) Basal epithelial genes. (C) Normalbreast-like genes. (D) ERBB2.

FIG. 8: Summary of pathway-specific and overlapping genes for theLuminal A and ERBB2+tumor subtypes. ‘U’ indicates upregulated genes and‘D’ indicates downregulated genes.

For example, there are 245 genes upregulated and 705 genes downregulatedduring the normal/DCIS (Luminal) transition. Numbers in bold areoverlapping genes between two gene sets. a) Results based upon afalse-discovery rate (FDR) of 5%. b) Results when only the top 100 mostsignificantly regulated unique genes are compared.

FIG. 9. a) Discovery of a Luminal D subtype. A series of previouslyhomogenous Luminal A tumors (identified as subtype S1 by the CIS inFIGS. 5 and 7 were regrouped by hierarchical clustering based upon‘proliferation cluster’ linked genes. Two broad groups are observed,which exhibit low (Luminal A) and high (Luminal D) levels of expressionof the ‘proliferation cluster’ respectively. b) High levels of the36-gene ‘proliferation cluster’ is also observed in other aggressivetumor types. Luminal D (15 out of 17 samples, indicated as dark barsunder sample numbers), Basal (ER−) and ERBB2+ve samples all stronglyexpress the 36-gene ‘proliferation cluster’ (bar below clustergram, leftbranch), while Luminal A (all but one boundary case), normal-like andnormals are show low levels of expression. Light grey/white indicatesupregulation, while dark grey/black indicates downregulation.

MATERIALS AND METHODS

Breast Tissue Samples

Primary breast tissues were obtained from the NCC Tissue Repository,after appropriate approvals had been obtained from the institution'sRepository and Ethics Committees. In general, all tumour and matchednormal tissues were simultaneously harvested during surgical excision ofthe tumour. After surgical excision, the samples were immediatelygrossly dissected in the operating theatre, and flash-frozen in liquidN2. Histological confirmation of tumour status was subsequently providedby the Dept of Pathology at Singapore General Hospital. Samples werestored in liquid N2 until processing was performed. With the exceptionof 1 tumour and matched normal sample pair that came from an Indianpatient, all other samples were derived from Chinese patients.Confirmation of the DCIS status of tissue samples used in this reportwas achieved both by conventional H & E staining of archival samples, aswell as direct cryosections of the actual sample that was processed forexpression profiling.

Sample Preparation and Microarray Hybridization

For hybridisations involving Affymetrix Genechips, RNA was extractedfrom tissues using Trizol reagent, purified through a Qiagen SpinColumn, and processed for Affymetrix Genechip hybridization according tothe manufacturer's instructions. For each spotted cDNA microarrayhybridization 2-3 μg of total RNA was used following single-round linearamplification (Wang et al., 2000). All breast samples for the spottedcDNA microarray hybridisations were compared against a standardcommercially available mRNA reference pool (Strategene) that had beensimilarly amplified. cDNA microarrays were fabricated following standardprocedures (DeRisi et al., 1997), using cDNA clones obtained fromvarious commercial vendors (Incyte, Research Genetics). Except wherementioned, samples were fluorescently labelled using Cy3 dye, while thereference was labelled with Cy5. Hybridizations were performed usingAffymetrix U133A Genechips. After hybridization, microarray images werecaptured using a CCD-based microarray scanner (Applied Precision, Inc).

Data Processing and Analysis

For spotted cDNA microarray data, fluoresence intensities correspondingto individual microarrays were uploaded into a centralized Oracle 8idatabase. Establishment of various data sets and gene retrievals wereperformed using standard SQL queries. Hierarchical clustering wasperformed using the program Xcluster (Stanford) and visualized using theprogram Treeview (Eisen et al., 1998). To identify outlier genes intumour and normal datasets, array elements were chosen whichconsistently exhibited greater than 3-fold regulation across 90% of allarrays for the normal dataset and 80% of all arrays for the tumourdataset. Correlation analysis was performed using the similarity metricconcept employed in Golub et. al. (1999). Briefly, the similaritymetrics corresponding to the normal/tumour class distinction werecalculated for each gene, and the genes then sorted based on descendingorder of their similarity values. After being sorted by their positiveand negative correlation to the class distinction, the top 10 genes fromeach class were chosen for subsequent cluster analysis. PrincipalComponent Analysis (PCA) was performed by linearly transforming the geneexpression matrix, which consists of a number of correlated variables,into a ‘smaller’ number of uncorrelated variables (principalcomponents). For datasets in linear subspace, the data can be‘compressed’ in this manner without losing too much information whilesimplifying the data representation. The first principal componentaccounts for maximum variability in the data, and each succeedingcomponent accounts for parts of the remaining variability.

For Affymetrix Genechips, Raw Genechip scans were quality controlledusing a commercially available software program (Genedata Refiner) anddeposited into a central data storage facility. The expression data wasfiltered by removing genes whose expression was absent in all samples(ie ‘A’ calls), subjected to a log2 transformation, and normalized bymedian centering all remaining genes and samples. Data analysis was thenperformed either using the Genedata Expressionist software analysispackage or using conventional spreadsheet applications. The unsuperviseddataset of 1796 genes used in FIG. 1 was established by selecting genesexhbiting a standard deviation (SD) of >1 across all well-measuredsamples. Average-linkage hierarchical clustering, was applied by usingthe CLUSTER program and the results were displayed by using TREEVIEW(9). Significance analysis of microarrays (SAM) was performedessentially as described in Tusher et al., (2001) (10), using afold-change cutoff of 2 and an appropriate delta value to cap the genefalse-discovery rate (FDR) at 5% (0.05).

Creation of a Common Intrinsic Geneset (CIS)

Genes common to both the U133A Genechip Probe Set and the ‘intrinsic’dataset as defined in Perou et al., (2000) were selected in thefollowing manner: Out of the original ‘intrinsic’ set consisting of 456cDNA clones, 428 could be assigned to a specific Unigene cluster usingthe Stanford Source database (Unigene Build 156). This number was thenreduced to 403 genes after the removal of duplicate genes. The U133AGenechip probe set was then queried using this list, yielding 292matches, or 72.5% of the original ‘intrinsic’ set (counting only uniquegenes).

Results

Partitioning of Normal and Tumour Breast Specimens Using UnsupervisedClustering

The inventors used cDNA microarrays of approximately 13,000 elements togenerate gene expression profiles for a set of 26 grossly-dissectedbreast tissue specimens (14 tumour, 12 normal) obtained from patients ofprimarily Chinese ethnicity (see Materials and Methods). Afterhybridization and scanning, approximately 8,000 array elements werefound to exhibit flourescence signals significantly above backgroundlevels, and these elements were used for subsequent analysis. Initially,the inventors found that an unsupervised clustering methodology basedupon a number of commonly used data filters (e.g. selecting genesexhibiting at least 3-fold regulation across at least 4-5 arrays) (seePerou et al., 1999, Wang et al., 2000) resulted in an array clustergramshown in FIG. 1. Specifically, the sample set segregated into two broadgroups, with each group consisting of a mixture of tumour and normalspecimens. However, within each group, the inventors found that thetumour and normal tissues effectively segregated into fairly independentsub-branches. The observation that tumour and normal tissues can besegregated using unsupervised clustering suggests that specific genesmay exist that can effectively distinguish between a tumour and normalsample. However, in the context of a large unsupervised data set, it isalso clear that these genes are only capable of distinguishing betweennormal and tumour samples in sub-branches of the correlation dendogram,rather than at the level of a primary class division. Similar findingshave also been reported in other breast cancer expression profilingprojects (Perou et al., 2000), suggesting that at the level of globaltranscriptosome, the expression levels of other genes may ‘supercede’the information encoded by genes involved in the tumour/normal classdistinction (see discussion).

Use of Outlier Genesets to Classify Normal and Tumour Samples

One of the main objectives of the inventors' research is to identifygenes or gene subsets that are of significant diagnostic or therapeuticpotential. To be of clinical utility, it will be necessary to identify aclass of genes that can accurately predict if an unknown breast tissuesample is normal or malignant at the level of the primary, rather thansecondary, class division. To identify these genesets, or ‘geneticidentifiers’, a number of supervised learning strategies, such asneigborhood analysis and artificial neural networks, have beenpreviously described (Golub et al., 1999, Khan et al., 2001). However,the inventors used a slightly different strategy to identify theseelements that focuses on the use of highly reproducible outlier genes.In this methodology, samples belonging to different classes areinitially established as independent datasets. Within each group, genesthat are consistently up or downregulated (‘outliers’) across all orclose to all arrays are then identified. These separate ‘outlier groups’are then combined, and the ability of the combined set of genes todistinguish between the two classes is then assessed using standardclustering methodologies.

The inventors first established outlier gene subsets for both the normaland tumour populations. To avoid biases that might be introduced byfluorophore labelling, they also included in each group 5 ‘reciprocal’expression profiles in which the sample and reference RNA populationwere inversely labelled. This analysis identified 60 highly reproducible‘outlier’ genes for the normal group and 75 genes for the tumour groupthat were either consistently up or down-regulated across all or closeto all arrays (FIG. 2). A cross-comparison of the normal and tumouroutlier sets revealed a number of genes in common between both sets.(Table 1), leading to a final combined outlier geneset (referred to asthe COG) of 108 genes.

The COG was then used to cluster the 26 breast tissue samples. Incontrast to the large-scale clustergram observed in FIG. 1, theinventors found that clustering using the genes found in the COGeffectively segregated the majority of tumour and normal samples intotwo principal branches, with 2 mis-classifications (FIG. 2 a).Specifically, 1 normal sample and 1 tumour sample were mis-assigned, andin the former case a quality check of the gene expression valuesrevealed that this sample was associated with a number of so-called‘missing’ values (grey bars in clustergram), which may have led to thissample being mis-classified. Nevertheless, the majority of samples werecorrectly grouped, suggesting that for certain datasets, ‘outlieranalysis’ may serve as a simple and effective method to identifydiscriminating genes between distinct classes.

Definition of a Minimal Genetic Identifier for the Normal vs TumourClass Distinction in Breast Tissues

Despite representing a dramatic reduction in the number of genes fromthe initial data set (8,000 to 108), the number of elements contained inthe COG is still too large to be feasibly included in its entirety aspart of a potential diagnostic assay. Ideally, a diagnostic genesetshould consist of i) a minimal number of elements, ii) be of highpredictive accuracy, and iii) represent a mixture of genes that arepositively and negatively correlated to the class distinction inquestion. To further reduce the combined outlier geneset to its mostinformative elements, the inventors used correlation analysis toidentify and rank genes in the COG that are most highly correlated tothe tumour/normal class distinction (see Materials and Methods). The 10most highly positively and negatively correlated genes were thenassessed in their ability to accurately classify the breast samples. Theinventors found that this minimal set of 20 genes, referred to as a‘genetic identifier, accurately classified all of the normal and tumoursamples (FIG. 2 b and Table 2). The genes that make up the ‘geneticpredictor’ represent a mixture of genes known to be involved in breastand tumour biology, as well as other genes whose role in tumourformation have not as yet been described (see discussion).

Predictive Capacity of the 20-gene ‘Genetic Identifier’

All analyses done up to this point were performed on the same ‘training’set of 26 breast samples, and thus the predictive power of the20-element geneset has not been addressed. To assess the robustness ofthis ‘genetic identifier’, the inventors followed the strategy of Golubet al (1999) and tested the ability of the minimal predictor to classifya naïve ‘test set’ of another 22 breast samples, of which 12 sampleswere tumours and the remaining 10 were non-malignant. In a similarfashion to the training set, they found that the 20-gene geneticidentifier was also able to classify the naïve set with completeaccuracy (FIG. 3 b). Thus, it appears that the ability of the ‘geneticidentifier to predict if a given breast sample is normal or malignant isnot confined to the training-set from which it was generated. Instead,the number of elements in this geneset, although minimal, may be ofsufficient sensitivity and informative power to give it predictivevalue.

Assessing the Global Level of Variation between Normal and Tumour BreastTissues

Breast tumours are clinically characterized by wide variations inclinical courses, disease aggressiveness, and response to medication.Consistent with these wide phenotypic variations has been the findingthat individual breast tumours can exhibit large variations in theirglobal gene expression patterns (Perou et al., 2000). One commonhypothesis to explain these wide variations is to consider them as theconsequences of multiple independent pathways of tumourigenesis.However, normal breast tissues are also highly environmentally andhormonally sensitive, and the specific state of a normal breast tissuein a particular patient is often dependent upon numerous demographicfactors, such as age, menopausal status, and medication history. Thus,it is formally possible that a certain amount of the variations inexpression state observed in tumours may also be reflected innon-malignant breast tissue as well. Since the inventors' data setconsists of both normal and malignant samples, they were able to comparethe inherent variability of normal and tumour samples to each other. Toperform this comparison, they utilized principal component analysis(PCA) on the entire 8,000 gene expression matrix, comprising a total of22 non-malignant and 26 tumour specimens. Using PCA, the inventorsreduced the total gene set to a series of distinct ‘components’, inwhich each component represents a finite amount of gene expressionvariation across the primary data set. They hypothesized that observedvariation in the data could arise from multiple sources, such asintrinsic biological variation, as well as experimentally introducedvariation (such as differences in sample harvesting, hybridization andlabelling conditions, etc). However, since the normal and tumour sampleswere identically harvested, treated and processed in their experiments,variations due to experimental conditions and handling should be equallyshared between both groups. Thus, any differences in variation betweenthe tumour and normal groups can most likely be attributed to intrinsicbiological variation.

The inventors plotted the amount of variation observed in the normal andtumour data sets against their principal components (FIG. 4). In orderto effectively compare the two datasets, each component was normalizedto the first component in that dataset, resulting in a graph thatdepicts how the total variation across the dataset “decays” with eachsuccessive principal component (By convention, the first principalcomponent is usually taken to represent the elements that exhibitmaximal variation across the dataset). The inventors observed that as ageneral rule, every component corresponding to the tumour data setconsistently exhibited higher variation than an analogous component inthe normal data set. This data indicates that the gene expressionprofiles of normal breast samples are significantly more ‘static’ or‘unchanging’ when compared to tumour profiles, supporting the hypothesisthat the wide variations in gene expression observed in tumours may be aconsequence of breast tumours arising from multiple tumourgenicpathways.

Conservation of Molecular Subtypes of Breast Cancer Across DistinctEthnic Populations

The inventors then used Affymetrix Genechips to profile 56 invasivebreast cancers and 6 normal breast tissues that had been isolated fromChinese patients. The raw expression profile scans were subjected to oneround of quality control, data filtering and processing (see Materialsand Methods), and an unsupervised hierarchical clustering algorithm wasused to order the normalized profiles to one another on the basis oftheir transcriptional similarity. Using a dataset of 1796 genes, whichconstitute genes that are both well-measured across at least 70% of allsamples and which exhibited considerable transcriptional variationacross the samples (as reflected by having a high standard deviation),the inventors observed that the majority of the samples segregated intoseveral discernible groups that could be correlated to specifichistopathological parameters. For example, many of the ER+ tumorsclustered together ((S1) bar, FIG. 5 a), as did the ERBB2+/ER − samples((S3) bar). The normal breast samples also clustered as a discerniblegroup whose individual members exhibited very high correlation to oneanother, suggesting that there is less transcriptional variation innormal breast tissues as compared to tumors. A number of samples,however, were not accurately segregated by the unsupervised clusteringalgorithm (gray bar)—it is possible that such ‘mixed clustering’ resultsmay be attributable to ‘noise’ contributed by non-malignant componentsin the primary tissue sample, such as normal breast epithelial tissue,lymphocytic infiltrates, and reactive desmoplastic tissue. As previouslymentioned, a similar observation was obtained using the cDNA microarrayplatform, suggesting that this phenomena is technology-platformindependent.

One objective of this study was to determine if the molecular subtypesand associated expression signatures defined in previous publishedstudies were also detectable in a separate patient population. Theinventors focused on correlating their expression results to that ofPerou et al (2000), a landmark study in which a similar analysis hadbeen performed on a series of breast cancer specimens derived from USand Norwegian patients. Briefly, in that study and a subsequentcompanion report (Sorlie et al., 2001), the authors determined thatinvasive breast cancers could be subdivided into at least 5 distinctmolecular subtypes based upon an ‘intrinsic’ geneset representing geneswhose transcriptional variation is primarily due to the malignant tumorcomponent. The specific expression signatures that represent the‘hallmark’ elements of each particular subtype are summarized in Table 1(this dataset is henceafter referred to as the Stanford study). Betweenthe Stanford study and the inventors work, there are several differencesin methodology and experimental design, such as differences in samplehandling protocols, patient population, and expression array platform(2-color cDNA microarray in the Stanford study vs 1-color Genechips inthe inventors' study, as well as different array probe sequences). Theavailability of two distinct breast cancer expression datasets fromindependent institutions (Stanford and the inventors) thus allowed theinventors to test whether, despite these differences, if the molecularsubtypes defined in one institution's experiments are indeedsufficiently robust to be detectable in another institution's study.

To perform this analysis, the inventors first identified probes on theAffymetrix U133A Genechip corresponding to genes belonging to the‘intrinsic’ set as defined by the Stanford study (see Materials andMethods). Of 403 unique genes found in the Stanford ‘intrinsic’ set, 292genes, or 72.5% of the intrinsic set, were also found on the Genechiparray. The inventors henceforth refer to this overlapping set of genesas the ‘common intrinsic set’ (CIS). Importantly, the CIS still containsmany of the ‘hallmark’ genes whose transcription was reported in theStanford study to be useful for discriminating between subtype, andreclustering of the Stanford tumors using the CIS also yielded highlysimilar groupings to that obtained using the full intrinsic set (datanot shown). When the invasive cancers in the inventors' series werereclustered on the basis of the CIS, they observed a strikingimprovement in the segregation pattern where now all the cancer samplesgrouped into highly distinct classes. The inventors then proceeded tocompare the molecular subtypes defined in their study to thosediscovered by the Stanford study (Luminal A, Luminal B/C, Basal,Normal-like, and ERBB2+) (Perou et al., 2000; Sorlie et al., 2001).

Luminal subtypes: All of the cancers in this group were ER + byconventional immunohistochemisty. The Stanford study defined at leasttwo groups of luminal tumors—Luminal A and Luminal B/C, the latter beingassociated with a poorer clinical prognosis (Luminal B and C tumors aretreated as a single class, as it is reportedly difficult to divide theminto two discrete groups (Sorlie et al., 2001). Consistent with theStanford study, the inventors also observed the presence of a robustLuminal molecular subtype that was highly similar to the Luminal Asubtype of the Standford study, as this subtype was characterized byhigh levels of expression of ER and related genes such as GATA3, HNF3a,and X-box Binding Protein 1 (bar (S1). They could not, however, clearlydetermine if the Luminal B/C subtypes as defined by the Standford studywere also present in their patient population, based upon the criteriathat both the B/C subtypes are associated with intermediate levels of ERrelated gene expression, and that the luminal C subtype also expresseshigh levels of a ‘novel’ gene cluster. The inventors also observed thepresence of a second luminal subclass (ER+/ERBB2+) which was distinctfrom the luminal A cancers in that this other subclass expressedintermediate levels of ER-related genes (similar to Luminal B/C) andgenes found in the ‘novel’ cluster (similar to luminal C, bar (S2). Thissubclass, however, also expressed high levels of ERBB2-related genes,and is thus likely to be distinct from the luminal C cancers defined bythe Stanford study, as luminal C cancers express low levels of the ERBB2gene cluster. Taken collectively, the inventors' results indicate thatLuminal A tumors (“Luminal in FIG. 5) constitute a robust molecularsubtype that can be commonly found across different patient populations.Conversely, the luminal B/C and ER+/ERBB2 +ve subtypes may representless robust variants whose presence may be more significantly affectedby differences in ethnic specificity, sample handling protocols, orarray technology.

As seen in FIG. 5, tumours belonging to the Luminal category (subtypeS1) appear to be transcriptionally homogenous on the basis of the CIS.To determine if tumours belonging to this subtype could be furthersubdivided, the inventors reclustered a larger group of Luminal tumoursusing a separate set of genes which in a previous report had been shownto be indicative of a tissue's cellular proliferative status (Sorlie etal., 2001).

On the basis of these “proliferation genes”, they found that the Luminaltumours could be subdivided into two distinct types, namely, “pure”luminal A and another subtype that they have referred to as a Luminal Dsubtype (FIG. 9 a). It is likely that the Luminal A/D subdivision isclinically meaningful, as a reclustering of a more diverse set oftumours on the basis of the “proliferation genes” resulted in two broadsubdivisions, one representing clinically aggressive tumours (Basal,ERBB2 and Luminal D), and the other representing tumours that are moreclinically tractable (Luminal, Normal/Normal-like) (FIG. 9 b).

Basal-like: The basal molecular subtype was reported in the Stanfordstudy to be characterized by high levels of two expression signatures—I)markers of the basal mammary epithelia, such as keratin 5 and 17, andII) genes belonging to the ‘novel’ cluster. Consistent with the Stanfordstudy, the inventors also observed a basal subtype associated withsimilar expression signatures (bar(S4)), indicating that the basalmolecular subtype is also highly robust. In addition, however, they alsodetected the apparent presence of another subtype (bar (S5)) that wasnot associated with any of the expression signatures described in theStanford study.

Normal Breast-like: The ‘normal-like’ subtype is ssociated withexpression of a gene cluster that is also highly expressed in normalbreast tissues, and includes genes such as four and a half LIM domains1, aquaporin 1, and alcohol dehydrogenase 2 (class I) beta. A number oftumors in the inventors' series also clustered with the normal breasttissues and exhibited this expression signature (bar (S6)). Thus, the‘normal-like’ molecular subtype can also be considered to be a robustsubtype.

ERBB2+: The Stanford study also defined a final ERBB2+ subtype in whichthese tumors were characterized by high levels of expression of ERBB2related genes (column E), intermediate levels of expression of the‘novel’ cluster (column B), and absent expression of ER-related genes(column A). A similar ERBB2+ subtype was also clearly present in theinventors' series (bar (S3)). Consistent with the expression data, theyalso subsequently confirmed that the tumors belonging to this molecularsubtype were all ERBB2+ by conventional immunohistochemistry as well.

To summarize, of the 5 molecular subtypes defined by the Stanford study,the inventors clearly detected at least 4 subtypes in their own patientpopulation (luminal A, basal-like, normal breast-like, and ERBB2+). Theycould not clearly determine if one particular subtype (luminal B/C) waspresent in their series using the genes in the CIS, and they alsodetected the potential presence of 2 additional subtypes (ER+ ERBB2+ andER− Subtype II) which have not been reported before. The finding thatthat the majority (4/5) of the Stanford molecular subtypes were alsoclearly detectable in the inventors' study suggests that despite manymethodological differences between centres, that molecular subtypes asdefined by expression based genomics are indeed remarkably robust andconserved between different patient populations.

Ductal Carcinoma In Situ (DCIS) Cancers Express The Hallmark ExpressionSignatures of Invasive Cancer Molecular Subtypes

The previous results indicate that molecularly similar subtypes ofbreast cancer can indeed occur and be detected across distinct ethnicpopulations. One limitation of these studies, however, is that it isoften very difficult to profile the same cancer over an extended periodof time. As such, one question that is often raised is whether thesemolecular variants represent subtypes that are truly distinct biologicalentities, or whether they simply reflect a single or a few subtypes indifferent stages of evolution. Since these two different theories,referred to as the ‘distinct origins’ and the ‘evolutionary’ hypothesesrespectively (FIG. 6 e), have different implications for clinicaldiagnosis and subsequent staging and monitoring, it is thus important todetermine which of these proposed mechanisms is the case for breastcancer. Unfortunately, it is not possible to distinguish between thesetwo models by only studying invasive cancers that have been sampled at asingle point in time, as both hypotheses would be expected to produceresults similar to that shown in FIG. 5.

In conventional histopathology, ductal carcinoma-in-situ (or DCIS) haslong been recognised as the major precursor to invasive breast cancer,and likely represents the earliest morphologically detectable malignantnon-invasive breast lesion. Despite their malignant status, however,DCIS cancers are also distinct from invasive cancers in a number ofrespects. Clinically, DCIS cancers are treated differently from invasivecancers (DCIS cases are primarily treated with surgery with or withoutadjuvent radiotherapy) (Harris et al., 1997), and DCIS and invasivecancers also differ substantially in their distribution of specificcancer types (Barnes et al., 1992; Tan et al., 2002). Differences suchas these raise the possibility that while DCIS cases are malignant, theymay also be molecularly distinct in some respects from more advancedinvasive cancers. The inventors reasoned that the ‘distinct origins’ and‘evolutionary’ hypotheses could be tested by profiling a series of DCIScancers and comparing their profiles to their invasive counterparts.Each hypothesis carries different predictions. If the ‘distinct origins’hypothesis is true, then the DCIS cancers, representing ‘early’ cancers,should express many, if not all, of the hallmark expression signaturesassociated with their more mature invasive counterparts. Alternatively,if the ‘evolutionary’ hypothesis is correct, then one might expect thatthe DCIS profiles to be more closely similar to one another than totheir invasive counterparts. The inventors obtained 12 DCIS tissuesamples whose histopathological status was confirmed by a pathologistboth using conventional H & E staining as well as frozen cryosections ofthe actual sample that was processed (FIGS. 2 a and b).

Expression profiles of the DCIS samples were then generated and comparedto their invasive counterparts. Using the CIS as a starting dataset, theinventors found that the DCIS samples segregated amongst the variousinvasive cancer samples into distinct categories. Specifically, 5 DCISsamples segregated into the Luminal subtype, 4 into the ER−/ER-/ERBBZTERBB2+ subtype, 2 into the ER+/ERBB2+ subtype, and 1 into the ‘normalbreastlike’ subtype. Importantly, within each subtype, each of the DCIScancers was found to robustly express the hallmark expression signaturesof its particular molecular group. Interestingly, no DCIS samples werefound to cluster within the basal or ER− subtype II molecular subtypes,which is consistent with previously proposed theories that thesesubtypes may develop without a (or possess an extremely transient) DCIScomponent (Barnes et al., 1992). These results suggest that distinctbreast cancer molecular subtypes are present even at the DCIS stage ofbreast cancer tumorigenesis, supporting the hypothesis that the subtypesrepresent truly distinct biological entities, possibly arising viadifferent tumorigenic pathways (the ‘distinct origins’ hypothesis).

Genes Associated with the Normal/DCIS/Invasive Cancer TransitionsImplicate Disregulation of Wnt Signaling as a Common Early Event inBreast Tumorigenesis and that Luminal A and ERBB2+ Cancers ExhibitSimilar Invasion Programs

Mammary tumorigenesis can be broadly divided into two main steps: First,normal breast epithelial tissue is transformed to a malignant state viathe concerted deregulation of various cellular pathways (Hahn andWeinberg, 2002). Second, to progress to an invasive cancer, severaladditional biological subprograms also have to be further executed,including penetration of the surrounding basement membrane, invasion ofthe cancer into the adjacent normal stroma, and angiogenic recruitmentof endothelial vessels for tumor nourishment and maintenance (Hanahanand Weinberg, 2000). Given the molecular heterogeneity of breast cancer,one important question in the field is the extent to which the geneticprograms that control these two key steps are subtype specific orcommonly shared among all breast cancer subtypes.

To identify genes whose expression level was significantly differentbetween normal breast tissues, DCIS cancers, and their invasivecounterparts, the inventors used significance analysis of microarrays(SAM), a robust statistical methodology that has been used in previousreports to identify significantly regulated genes (Tusher et al., 2001).They concentrated on studying the luminal and ERBB2+ cancers, as most ofthe DCIS samples in their study belonged to these two molecularsubtypes. First, they tested and confirmed the hypothesis that DCIScancers, despite expressing many of the hallmarks of invasive cancers,are nevertheless still transcriptionally distinct from invasive cancers.The inventors compared 5 luminal DCIS cancers to 5 luminal invasivecancers, and determined that there existed 222 genes that weresignificantly regulated using a 2-fold cut-off criterion and afalse-discovery rate (FDR) of 5%. In contrast, a control analysiscomparing only invasive luminal A cancers which had been randomlydistributed into 2 groups failed to identify any significantly regulatedgenes under these stringent conditions. A similar result was alsoobtained for DCIS and invasive cancers belonging to the ERBB2+ subtype(data not shown), indicating that significant transcriptionaldifferences exist between DCIS and invasive cancers belonging to boththe Luminal A and ERBB2+ subtypes.

SAM was then used to identify genes that were significantly regulatedduring either the normal/DCIS and DCIS/invasive transitions for both theluminal A and ERBB2 molecular subtypes (FDR=5%). The results aresummarized in FIG. 8 a. In total, for the luminal A subtype, a greaternumber of genes were significantly down-regulated during the normal/DCIStransition than upregulated (705 genes down vs 245 genes up), while forthe DCIS/Invasive transition more genes were significantly increased inexpression than decreased (56 genes down vs 277 genes up). Similarly,for the ERBB2 subtype, 367 genes were significantly downregulated and275 genes upregulated during the normal/DCIS transition, while 113 geneswere down-regulated and 294 genes upregulated during the transition fromDCIS to invasive cancer.

The following provides an outline as to how the genesets of Table 4, 5,6 and 7 were determined.

A “Genetic Identifier” that can Distinguish Between a Normal vs TumourBreast Sample

Methodology:

Data set: 95 Breast Tissue Samples (11 Normal and 84 Tumors)

Step 1: The data for each sample was normalized by median centering eachexpression profile around 5000 flouresence units (the Genechiptechnology measures expression abundance of each gene in terms offlouresence units, from 0 to 65535)

Step 2: An intensity filter was applied such that only genes withintensity values in the range of 200 to 100,000 were retained

Step 3: A ‘Valid value’ filter was applied such that genes that were atleast 70% present (ie above a minimum threshold value, usually about200) in either normals or tumors or both were retained chosen

Step 4: A statistical T-test was performed to select genes that weredifferentially expressed in normal vs tumors at a confidence level ofp<0.00001. This resulted in the selection of 507 genes

Step 5: Of the 507 genes, a high fold change filter was applied toselect genes that exhibited large differences in expression betweennormal and tumor samples (2.5-fold and above). This resulted in theidentification of 49 genes (up in tumors) and 81 genes (up in normals)respectively. These genes are listed in Table 4a.

Step 6: The 130 (49 and 81) genes were ranked using support vectormachine gene ranking in order to rank genes in the order of theirimportance in being able to assign an unknown breast sample to either atumor or normal group. This was done to arrive at a small subset ofgenes that can accurately predict normal from tumors. Top 32 genes gaveclose to 1% misclassification. The results are given in Table 4b.

Step 7: The 32 geneset was tested for its predictive accuracy in theclassification of normal vs tumor samples, using leave-one-outcross-validation (LVO CV) testing. No misclassifications were observed.

Support Vector Machine (SVM) Gene Ranking

This approach is used to rank the genes in a dataset according to theirimportance in being able to assign an unknown sample to a particulargroup. Typically, the samples in the dataset are divided into a (75%)training and (25%) test set. A maximum margin hyperplane separating thetwo classes (eg ER+ vs ER−) is calculated for the training set.

Assuming ‘m’ genes are present in the set, the equation of maximummargin hyperplane isH═W₁*G₁+W₂*G₂+. . . +W_(i)*G_(i)+. . . +W_(m)*G_(m)Where W_(i)'s are the weights and G_(i)'s refer to the variables(genes).

Using the genes corresponding to various top ‘N’ weights (weight isindicator of importance of gene in classification) the class of allsamples in the test set is predicted. The prediction rules are built forvarying sets of top N genes. The above procedure is repeated 100 timesand the gene ranks and misclassification rates are averaged.

“Genetic Identifiers” that can Predict the Estrogen Receptor Status andthe ERBB2 Receptor Status of a Breast Tumour Sample

Methodology:

Data set: 55 invasive breast tumor samples. The individual tumors wereassigned to the following groups on the basis of IHC(immunohistochemistry):

-   -   a) Estrogen receptor (ER) status: 35 ER positive and 20 ER        negative samples    -   b) c-erbB-2 (ERBB2) status: 21 ERBB2 positive and 34 ERBB2        negative samples.

Step 1: Gene selection to identify genes that are differentiallyexpressed between a) ER+ vs ER− tumors, and b) ERBB2+ vs ERBB2− samples.Three independent gene selection techniques were used

-   -   Significance Analysis of Microarrays (SAM), a statistical        technique that uses random permutations of the expression data        to estimate the ‘false discovery rate’, ie the chance at which a        particular gene will be falsely called as being differentially        expressed (Tusher et al., 2001). The genes are then ranked by        their “relative difference”, which is similar to the ranking        used in Step 6, above. The top 100 significant genes were        selected.    -   A signal to noise (S2N) strategy was used to rank genes based on        their correlation with the class distinction (either ER+/ER− or        ERBB2+/ERBB2−) (Golub et al., 1999). The top 100 genes were        selected.    -   A support vector machine (SVM) ranking strategy was used to rank        the genes according to their importance in assigning a breast        tumor sample to the correct class (see below). The optimal gene        set (with highest accuracy) was selected.

Step 2: Common Gene Set (CGS): The genes from the 3 independent analysiswere pooled, and the common genes selected by all three methods wereselected. Hence these genes are method-independent and sufficientlyrobust to be used as a ‘genetic identifier’ to predict either the ER orERBB2 status of a breast tumor sample.

Result:

-   -   For ER classification, the CGS contains 25 unique genes (18 up,        7 down regulated)    -   For ERBB2 classification, the CGS contains 26 unique genes (19        up, 7 down regulated)

The genes belonging to each CGS are listed in Table 5.

Finally, the accuracy of each CGS for tumor classification was assessedusing LVO CV testing. The classification algorithm used was a SupportVector Machine (SVM). Average cross validation error rate=7.286% for ERclassification (overall accuracy 92%), and 6.26% for ERBB2classification (overall accuracy 93%).

“Genetic Identifiers” that can Predict the Molecular Subtype of a BreastTumour Sample

Methodology

Data set: Expression Profiles for tumors belonging to the varioussubtypes were generated using Affymetrix U133A Genechips. The hallmarkexpression signatures that characterize each subtype are describedabove.

-   -   a) Luminal (19)    -   b) ERBB2 (19)    -   c) Basal (7)    -   d) ER negative type 2 (5)    -   e) Normal and Normal like (12)        A. Identification of a Minimal Geneset for Classification Using        a One-vs-All Support Vector Machine Approach

Step 1: The data for each sample was normalized by median centering eachexpression profile around 1000 flouresence units (the Genechiptechnology measures expression abundance of each gene in terms offlouresence units, from 0 to 65535)

Step 2: A ‘Valid value’ filter was applied such that genes that were atleast 70% present (ie above a minimum threshold value, usually about200) across all samples were chosen.

Step 3: Five different data sets were created are by leaving one of theabove-mentioned groups out and combining our remaining groups (ie‘One-vs-all’). Dataset Description 1 Luminal (19) vs Rest (43) 2 ERBB2(19) vs Rest (43) 3 Basal (7) vs Rest (55) 4 ER negative type 2 (5) vsRest (57) 5 Normal and Normal like (12) vs Rest (50)

Step 4: For each of the 5 datasets, genes were selected that exhibited aminimum 2 fold change between groups (Ratio of means was used tocalculate the fold change between two groups).

The results are as follows Differentially regulated Dataset Description(2 fold) 1 Luminal (19) vs Rest (43) 116 2 ERBB2 (19) vs Rest (43) 46 3Basal (7) vs Rest (55) 318 4 ER negative type 2 (5) vs 309 Rest (57) 5Normal and Normal like (12) 188 vs Rest (50)Step 5: A support vector machine gene ranking analysis was performed foreach of the five datasets to rank genes in the order of their importancein assigning an unknown breast sample to its appropriate class (e.g. ERor ERBB2 status, see above).

For datasets 1, 3, 4, and 5, a geneset was selected that yielded a 3%misclassification rate. In case the case of dataset 2 (ERBB2 vs rest),the use of all 46 genes gave a minimum of 9.7 error rate. Hence, all 46were used in the predictor set. The predictor sets are shown in Table 6.Differentially regulated Top ‘N’ Error Dataset Description (2 fold)genes rate 1 Luminal (19) vs Rest (43) 116 35 3 2 ERBB2 (19) vs Rest(43) 46 46 9.7 3 Basal (7) vs Rest (55) 318 20 3 4 ER negative type 2(5) vs 294 111 3 Rest (57) 5 Normal and Normal like 188 50 3 (12) vsRest (50)

Step 6: The samples were all combined into one dataset and one vs allcross-validation analysis was carried out using the various predictorsets. 100 independent iterations of 75:25 (training: test) random splitswere used, resulting in an overall cross validation error rate of 5.25%(Overall accuracy 94%).

B. Identification of a Minimal Geneset for Classification Using aGenetic Algorithm/Maximum Likelihood Discriminant (GA/MLHD) Approach

The GA/MLHD approach is a different classification algorithm (Ooi & Tan,2003) that serves as an alternative to the OVA SVM described in A.

Step 1: Samples were broken down into the following classes: No. ofClass samples ER- subtype II 5 ERBB2+ 19 Normal and 12 Normal-likeLuminal 19 Basal 7

A truncated dataset of 1000 genes was then established by selectinggenes that exhibited the largest standard deviation (SD) across all thesamples.

Step 2: 24 runs of the GA/MLHD algorithm were performed on the 62 breastcancer samples based on the class distinction described in Table 4. Theaccuracy of the predictor sets selected by the GA/MLHD algorithm wereassessed by cross-validation and independent test studies.

Details of GA/MLHD Properties:

-   -   (a) Crossover rates: 0.7, 0.8, 0.9, 1.0.    -   (b) Mutation rates: 0.0005, 0.001, 0.002, 0.0025, 0.005, 0.01    -   (c) Uniform crossover    -   (d) Selection: stochastic uniform sampling    -   (e) Predictor set size range: R_(min)=1 and R_(max)=80.

30 optimal predictor sets with sizes ranging from 13 to 17 genes perpredictor set were obtained. Each predictor set was associated with aclassification accuracy of 1 error out of 62 samples. (error rate:1.61%, overall classification accuracy 98%). 10 out of the 30 predictorsets wrongly classified the Luminal-A sample 980221T as a Normal sample.For the other 20 predictor sets, 19 misclassified the ERBB2+ sample990262T as a ER− subtype II sample, while 1 predictor set wronglyclassified the same 990262T sample as a Basal-type sample. Two of theoptimal predictor sets are displayed in Table 6b.

Identification of a Luminal D Subclass in the Asian Breast CancerPopulation

Previous breast cancer expression profiling studies done on primarilyCaucasian populations revealed the existence of a ‘luminal’ subtypecharacterized by the high expression of estrogen-receptor related genessuch as ESR1, GATA3, and LIV-1. Further, these ‘luminal’ cancers couldbe further subdivided into at least 2 further subtypes: Luminal A andLuminal B/C. While Luminal A tumors express very high levels of ERrelated genes, Luminal B/C cancers express intermediate levels of the ERgene cluster. Furthermore, luminal C tumors also express high levels ofa ‘novel’ gene cluster. Luminal B/C tumors were found to exhibit a worseclinical prognosis than Luminal A tumors, arguing that these subtypesare indeed clinically relevant.

A similar study on breast cancers derived from Chinese patientsperformed in Singapore confirmed that the luminal A subtype is alsopresent in the Asian patient population. However, the luminal B/Csubtype was not detected. The reasons behind this difference may be dueto methodological differences between the two studies or truedifferences in patient population.

A careful inspection of the original Caucasian study by the inventorssubsequently revealed that Luminal C tumors are also associated withhigh levels of a gene cluster whose members are involved in cellularproliferation. In contrast, this ‘proliferation cluster’ is lowlyexpressed in Luminal A tumors. The high expression of genes in the‘proliferation cluster’ may functionally contribute to the worseclinical prognosis associated with Luminal C tumors, as this highexpression levels of this cluster is also seen in tumors belonging tothe clinically aggressive ERBB2+ and basal (ER−) subtypes as well. Thus,although a luminal B/C subtype was not observed in the Asian breastcancer population, the inventors hypothesized that the genes in this‘proliferation’ cluster could also be used to subdivide the previouslyhomogenous Luminal A tumors found in the Asian population into distinctluminal subtypes.

Results

Identification of ‘Proliferation Cluster’ Linked-Genes on the AffymetrixU133A Genechip

In the inventor's study, the expression profiles of several breasttumors were obtained using commercially available Affymetrix U133AGenechips. Genes corresponding to the original ‘proliferation’ clustermembers were then selected from the Genechip. Of the 65 genes comprisingthe original ‘proliferation cluster’, the inventors determined at 36(55%) were also present on the Genechip array.

Discovery of a ‘Luminal D’ Subtype in the Asian Luminal Tumor Population

The inventors then used this 36-geneset to recluster a group of tumorswhich in their previous analysis had been homogenously assigned to theLuminal A subtype. As seen in FIG. 1, the 36-geneset strikingly dividedthe tumors into two broad groups chracterized by low and high levels ofexpression of the 36-geneset respectively. The former group is fromhenceforth referred to as the true ‘luminal A’ subtype, while the lattergroup is referred to as ‘luminal D’, as its expression profile isdistinct from previously identified subtypes.

High Levels of Expression of the 36-Geneset is Also Observed in OtherAggressive Tumor Subtypes

To determine if Luminal D tumors are also more clinically aggressivethan Luminal A tumors, the inventors then determined if high expressionlevels of this cluster was also observed in aggressive tumors subtypesby reclustering a larger series of their tumors using only the 36-gene‘proliferation cluster’. As seen in FIG. 2, Luminal D tumors intermixedwith tumors of the ERBB2+ and Basal subtypes, while Luminal A tumorsmixed with the normal and ‘normal-like’ tumors. This result suggeststhat the Luminal D tumors may share certain hallmarks of more highlyaggressive tumors, and that the Luminal D subtype may be clinicallyrelevant.

A ‘Genetic Identifier’ for the Luminal D Subtype

The inventors then proceeded to develop a ‘genetic identifier’ for theLuminal D subtype. In this strategy, the ‘genetic identifier’ shouldonly be applied to a tumor that has previously been characterized asLuminal in nature, for example by the other ‘genetic identifiers’ shownin Tables 5 and 6.

Step 1: A series of expression profiles for 19 tumors which had beenpreviously characterized as Luminal A were normalized by mediancentering each expression profile around 1000 flouresence units.

Step 2: A ‘Valid value’ filter was applied such that genes that were atleast 70% present (ie above a minimum threshold value, usually about200) across all samples were chosen.

Step 3: To divide the samples in a more robust fashion, a PrincipalComponent Analysis (PCA) was then used to ascertain the Luminal A and Dsubgroups using the 36 proliferation geneset (FIG. 3).

Step 4: Using the Luminal A (12 samples) vs. Luminal D (7 samples)groupings, genes were selected from the entire expression profile thatexhibited a minimum 2 fold change between the two groups (Ratio of meanswas used to calculate the fold change between two groups). 111 suchgenes were identified in this analysis.

Step 5: A SVM gene ranking analysis was then performed for the 111-genedataset to rank genes in the order of their importance in assigning aluminal breast cancer sample into either the Luminal A or Luminal Dsubtypes. The top 45 genes gave lowest error rate (about 12%). 18 geneswere up regulated in Luminal D and 27 were down regulated in luminal D.The genes are depicted in Table 7.

Step 6: The accuracy of the 45-gene Genetic identifier was then assesedusing leave one out cross validation. No misclassifications wereobserved.

Discussion

One outstanding challenge of the post-genomic era is to translate thehuge amounts of raw sequence data generated by various genome sequencingprojects into applications that improve healthcare and the treatment ofdisease. One area which could be revolutionised by the availability ofthese new resources is in the field of molecular diagnostics, where thepathologic classification of a tissue, in complementation toconventional histopathology, is also based upon a set of informativemolecular markers. Importantly, one advantage of the molecular approachis that the resolving power of classification schemes based uponmolecular data can be sufficiently sensitive to detect clinicallyrelevant disease subtypes that have currently eluded traditional lightmicroscropy approaches (Ash et al., 2000, Bittner et al., 2000).

However, before the potential of molecular diagnostics can fullyrealized, a number of challenges must be met and overcome. Firstly, formany common diseases, key informative genes that are able todiscriminate between the relevant disease sub-classes in question mustbe identified. Secondly, in order to be feasibly utilized as part of aclinical assay, these genes must be ‘pared’ down to a minimal set.(‘genetic identifiers’) that collectively still delivers high predictiveaccuracy. Thirdly, because the clinical behaviour of many diseases canvary extensively amongst different ethnic groups and populations, itwill be necessary to define appropriate limits of use of these ‘geneticidentifiers’ for specific patient populations.

To address these issues, the inventors have embarked upon a large-scaleexpression profiling project of breast tissues derived from Asianpatients. Previous reports have primarily focused on using samplesderived from patients of primarily Caucasian origin (Perou et al., 2000,Gruvberger et al., 2000, Hedenfalk et al., 2000), and it is essential todetermine if findings obtained from these studies will be applicable toother ethnic populations. This is especially so given theepidemiological and clinical differences in breast cancer between thesedistinct ethnic groups. In Caucasian populations, the majority of breastcancers tend to occur in post-menopausal women. However, in Singaporeand Japan, the absolute number of breast cancer cases per year isroughly ⅓ that of the US and the incidence of breast cancer in thesepopulations is bi-modal—the first peak, representing the majority ofbreast cancers, occurs in pre-menopausal women occurs at around the ageof 40 (Chia et al., 2000). This first peak is then followed by a secondpeak at about age 55-60. The earlier incidence of breast cancer in Asianpopulations is unlikely to be due to earlier detection, as breast cancerscreening programs in these countries are still relatively novelcompared to Western countries. To explain these observations, onepossibility may be that the breast cancers observed in these groups mayrepresent distinct heterogenous subtypes arising from specific geneticor environmental differences. For example, it is known that the levelsof estrogen and progesterone in Chinese women tend to be substantiallylower than in Caucasians (Lippman, 1998).

To ensure maximal diversity in the repertoire of expression profilesused in the inventors' analysis, the inventors selected samples derivedfrom patients from a wide variety of demographic and clinicalbackgrounds, as well as tumours of varying grades and appearances.First, the inventors identified a ‘genetic identifier’ in breast cancerfor what is perhaps the most basic distinction of clinical utility—i.e.distinguishing if a given sample is ‘normal’ or ‘malignant’. Althoughthis distinction can be currently made by a qualified pathologist usingconventional histopathology, the availability of such a molecular assaywould still be of use in clinical settings where rapid diagnosis isrequired, or when a pathologist may not be readily available. Byfocusing on highly reproducible ‘outlier’ genes in both normal andtumour datasets, the inventors identified a minimal set of 20 genes thatis apparently able to accurately predict if an unknown breast sample isnormal or malignant in both a training set and naïve test set ofcomparable sample quantity. In addition, using principal componentanalysis, they were able to show that at the expression profiles ofnormal breast samples appears to be far less varied than theircorresponding tumour profiles. In the field of breast cancer research,there are surprisingly relatively few reports in the literature thathave directly addressed the question of distinguishing between normaland tumour tissues using the relatively unbiased manner afforded by theDNA microarray approach. In one major study, it was found that that theexpression profiles of normal breast tissues were sufficiently similarfor them to co-segregate with each other using an unsupervisedclustering methodology (Perou et al., 2000). However, in that report,the investigators also found that the normal samples, rather thansegregating as an independent branch distinct from the tumour samples,instead segregated within a broad tumour class originating from mammaryepithelial cells of ‘basal’ or ‘myoepithelial’ origin. This result, mostlikely due to the similarity of genes that are expressed in normaltissues and tumours of this subclass, illustrates that it may not betrivial to use purely unsupervised methodologies to discriminate betweennormal and tumour breast tissues. However, while this appears to be anissue for breast cancer genomics, it may not apply to other tissuetypes. For example, it appears that unsupervised clustering is able todiscriminate between normal and malignant colon samples (Alon et al.,1999). One reason for this may be that colon tumours, which primarilyarise from disruption of the APC/β-catenin pathway, may be geneticallymore uniform than breast tumours.

The genes involved in the 20-gene ‘genetic identifier’ belong to manydifferent categories. Genes such as apolipoprotein D are well-knownterminal differentiation genes in breast biology, while MAGED2 waspreviously isolated as a gene that is overexpressed in primary breasttumours, but not in normal mammary tissue or breast cancer cell lines(Kurt et al., 2000). Another gene, ITA3, which produces the alpha-3subunit of the alpha-3/beta-1 integrin, has been shown to be associatedwith mammary tumour metastasis (Morini et al., 2000). The CAV1 protein,which links integrin signaling to the Ras/ERK pathway, has alsopreviously been identified as a potential tumour suppressor gene (Waryet al., 1998, Weichen et al., 2001), which may explain its expression innormal breast tissues but not tumours. In addition to genes with knownroles in breast and tumour biology, other intriguing genes wereidentified whose role in tumourgenesis is unclear or not known. Forexample, thrombin, best known for its role in the coagulation cascade,has recently been shown to inhibit tumour cell growth, which may explainits expression in normal but not tumour breast samples (Huang et al.,2000).

Another example is the human homolog of the S. cerevisiae PWP2 gene,which in yeast plays an essential role in cell growth and separation(Shafaatian et al., 1996).

To gain insights into the diversity of breast cancer molecular subtypesin the Asian population, the inventors then generated and analyzed aseries of expression profiles of both invasive breast cancers and DCIScancers. The aim of this work was to attempt to validate the molecularsubtyping scheme defined in the Stanford study using another breastcancer expression dataset. By comparing their expression profiles topreviously published studies performed using patient samples ofprimarily Caucasian origin, they found that the majority of molecularsubtypes and hallmark expression signatures were robustly conservedbetween the two series. Although a similar validation study has recentlybeen reported for prostate cancer (Rhodes et al., 2002), this report isthe first time such a comparative analysis has been performed for breastcancer. The conservation of molecular subtypes between the twopopulations is all the more remarkable when one considers the manymethodological differences existing between the studies. For example,one finding of interest was the inventors' ability to detect similarsubtypes in both series despite the differences in array technologyplatform. This result is significant as there is currently conflictingdata in the field regarding the feasibility of integrating data fromdifferent genomic expression technologies. For example, in Rhodes etal., (2002), it was reported that prostate cancer expression data fromspotted cDNA arrays yielded similar data to oligonucleotide arrays.

In contrast, another recent report comparing the expression profiles ofcell lines as measured by spotted and oligonucleotide arrays reported avery poor correlation between the studies (Kuo et al., 2002). Theinventors' results suggest that data from different technology platformscan indeed be compared, so long as the subtype distinctions in questionare fairly robust in nature. The inventors' results also suggest thatdespite the epidemiological differences in breast cancer between theAsian and Caucasian population (see beginning of Discussion), thatbreast cancers between the ethnic groups are to a first approximationhighly molecularly similar.

The inventors also found that DCIS cancers robustly express manysubtype-specific gene expression signatures, suggesting that thesemolecular subtypes can be discerned even at this pre-invasive stage.Thus, it is unlikely that these subtypes represent an evolving cancerclass, but are distinct biological entities that may posses differenttumorigenic origins. Despite the expression of subtype-specificexpression signatures in DCIS cancers (as reported in this study), thereis other evidence in the field that DCIS cancers may be distinct frominvasive cancers. For example, previous retrospective reports have shownthat the majority of low nuclear grade DCIS tumors undergo a longclinical evolution to invasive cancer (Page et al., 1982; Betsill etal., 1978; and Rosen et al., 1980), suggesting that additional geneticevents must occur before they become invasive. In addition,histopathological studies have found that there is a considerabledifference in the histopathological distribution of tumor types in DCIScancers vs invasive cancers, with ERBB2+cancers being much more highlyrepresented in DCIS compared to invasive cases (Barnes et al., 1992). Ithas been unclear, however, if this observation should be interpreted tomean that that the ER-ERBB2− cancers lack a DCIS component, or if theERBB2+ cancers will eventually evolve to a ERBB2− state. The distinctivesegregation of the DCIS cancers in the inventors' series suggests thatthe former is true, since the ERBB2+ cancers already express many ERBB2+invasive hallmarks.

Finally, by integrating the expression profiles of normal, DCIS, andinvasive cancers belonging to the luminal A and ERBB2+subtypes, theinventors were able to define sets of genes which were regulated in acommon and subtype-specific manner during the normal, DCIS, and invasivecancer transitions. Although the results of these analyses clearly needto be supported by further experimental work before any definitiveconclusions can be made, there were a number of intriguing observations.The inventors found that a number of components of the Wnt signalingpathway were commonly regulated during the transition from normal —>DCISfor both subtypes, implicating deregulation of Wnt signaling as animportant common event in breast cancer carcinogenesis. Althoughprevious reports have reported the involvement of the Wnt pathway inhuman breast cancer carcinogenesis (Smalley et al., 2001), it has beenless clear if this is an early or late event. The inventors' resultssuggest the former possibility is more likely.

Secondly, the remarkable commonality of genes regulated from the DCIS tothe invasive stage between the two subtypes suggests that many of thegenetic processes that underlie cellular invasion, desmoplasticreaction, stromal remodeling etc, may be fairly general and sharedacross different breast cancer subtypes. Finally, the inventors' resultsalso suggest that both cancer subtypes may be highly metabolicallydistinctive, with ERBB2+ tumors having a greater reliance onionic-related processes, while Luminal A tumors may be under a state ofchronic metabolic stress. These results are extremely important, forexample, the increased metabolic load of Luminal A tumors may explainwhy ER+ tumors are more radiosensitive than ER− tumors (Villalobos etal., 1996), and calcium signaling may play a role in tumor cell motilitycontrolled by the ERBB2+ receptor (Feldner and Brandt (2002).

REFERENCES

-   Alon, U., N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack,    and A. J. Levine (1999) Broad patterns of gene expression revealed    by clustering analysis of tumour and normal colon tissues probed by    oligonucleotide arrays. Proc Natl Acad Sci 96,-   Ash, A. A., M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A.    Rosenwald, J. C. Boldrick, H. Sabet, T. Truc, Y. Xin, J. I.    Powell, L. Yang, G. E. Marti, T. Moore, J. Hudson, L. Lisheng, D. B.    Lewis, R. Tibshirani, G. Sherlock, W. C. Chan, T. C. Greiner, D. D.    Weisenburger, J. O. Armitage, R. Warnke, R. Levy, W. Wilson, M. R.    Grever, J. C. Byrd, D. Botstein, P. O. Brown, and L. M.    Staudt (2000) Distinct types of diffuse large B-cell lymphoma    identified by gene expression profiling. Nature 403, 503-511-   Barnes, D. M., J. Bartkova, R. S. Champlejon, W. J. Gullick, P. J.    Smith, and R. R. Millis (1992) Overexpression of c-erbB2    Oncoprotein: Why does this occur more frequently in ductal carcinoma    in situ than in invasive mammary carcinoma and is this of prognostic    significance? Eur J Cancer 28, 644-648-   Betsill, W. L. J., P. P. Rosen, P. H. Lieberman, and G. F.    Robbins (1978) Intraductal carcinoma. Long-term follow-up after    treatment by biopsy alone. JAMA 239, 1863-1867-   Bittner, M., P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M.    Hendeix, M. Radmacher, R. Simon, Z. Yakhini, A. Ben-Dor, N.    Sampas, E. Dougherty, E. Wang, F. Marincola, C. Gooden, J.    Lueders, A. Glatfelter, P. Pollock, J. Carpten, E. Gillanders, D.    Leja, K. Dietrich, C. Beaudry, M. Berens, D. Alberts, V. Sondak, N.    Hayward, and J. Trent (2000) Molecular classification of cutaneous    malignant melenoma by gene expression profiling. Nature 406, 536-540-   Chia, K. S., A. Seow, H. P. Lee, and K. Shanmugaratnam (2000) Cancer    Incidence in Singapore, 1993-1997. In (Singapore Cancer Registry)-   DeRisi, J. L., V. R. Iyer, and P. O. Brown (1997) Exploring the    Metabolic and Genetic Control of Gene Expression on a Genomic Scale.    Science 278, 680-686-   Eisen, M. B., P. T. Spellman, P. O. Brown, and D. Botstein (1998)    Cluster analysis and display of genome-wide expression patterns.    Proc Natl Acad Sci 95, 14863-14868-   Feldner, J. C. and B. H. Brandt (2002) Cancer cell motility—on the    road from c-erbB-2 receptor steered signaling to actin    reorganization. Exp Cell Res 272, 93-108-   Giuliano, A. E. (1998) Breast. In Current Medical Diagnosis and    Treatment, 37, Ed. Tierney, L. M. S. J. McPhee and M. A. Papadakis    (Appleton and Lange, Stamford) 666-690-   Golob, T. R., D. K. Slonim, P. Tamayo, C. Huard, J. P.    Gaasenbeek, H. Coller, M. L. Loh, J. R. Downling, M. A.    Caligiuri, C. D. Bloomfield, and E. S. Lander (1999) Molecular    Classification of Cancer: Class Discovery and Class Prediction by    Gene Expression Monitoring. Science 286, 531-537-   Gruvberger, S., M. Ringner, Y. Chen, S. Panavally, L. H. Saal, A.    Borg, M. Ferno, C. Peterson, and P. Meltzer (2001) Estrogen Receptor    Status in Breast Cancer is Associated with Remarkably Distinct Gene    Expression Patterns. Cancer Research 61, 5979-5984-   Hahn, W. C. and R. A. Weinberg (2002) Rules for making human tumor    cells. N Engl J Med 347, 1593-1603-   Harris, J. R., M. Morrow, and L. Norton (1997) Malignant Tumors of    the Breast. In Cancer: Principles and Practice of Oncology, Ed.    Devita, V. T. S. Hellman and S. A. Rosenberg (Lippincott-Raven,    Philadelphia/N.Y.)-   Hanahan, D. and R. A. Weinberg (2000) The hallmarks of Cancer. Cell    100, 57-70

Hedenfalk, I., D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon,P. Meltzer, B. Gusterson, M. Esteller, O. P. Kallioniemi, M. Wilfond, A.Borg, and J. Trent (2001) Gene Expression Profiles in Hereditary BreastCancer. NEJM 344, 539-548

-   Huang, Y., J. Li, and S. Karpatkin (2000) Thrombin inhibits tumour    cell growth in association with up-regulation of p21(wafl/cipl) and    Caspases via a p53-independent, STAT-1-dependent pathway. J. Biol.    Chem. 275, 6462-6488-   Khan, J., J. s. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F.    Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson,    and P. S. Meltzer (2001) Classification and diagnostic prediction of    cancers using gene expression profiling and artificial neural    networks. Nature Med 7, 673-679-   Kurt, R. A., W. J. Urba, and D. D. Schoof (2000) Isolation of genes    overexpressed in freshly isolated breast cancer specimens. Breast    Cancer Res. Treat. 59, 41-48-   Kuo, W. P., T. K. Jenssen, A. J. Butte, L. O. Machado, and I. S.    Kohane (2002) Analysis of measured mRNA measurements from two    different microarray technologies. Bioinformatics 18, 405-412-   Kuukasjarvi, T., J. Kononen, H. Helin, K. Holli, and J. Isola (1996)    Loss of estrogen receptor in recurrent breast cancer is asociated    with poor response to endocrine therapy. J. Clin. Oncol. 14,    2584-2589-   Lippman (1998) Breast Cancer. In Harrison's Principles of Internal    Medicine, 91, Ed. Fauci, A. S. E. Braunwald K. J. Isselbacher J. D.    Wilson J. B. Martin D. L. Kasper S. L. Hauser and D. L. Longo    (McGraw-Hill, New York) 562-568-   Morini, M., M. Mottolese, N. Ferrari, G. Ghiorzo, S. Buglioni, R.    Mortarini, D. M. Noonon, P. G. Natali, and A. Albini (2000) The    alpha-3 beta 1 integrin is associated with mammary carcinoma cell    metastasis, invation, and gelatinase B (MMP-9) activity. Int J    Cancer 87, 336-342-   Ooi C. H. and Patrick Tan (2003). Genetic algorithms applied to    multi-class prediction for the analysis of gene expression data.    Bioinformatics. 19, 37-44.-   Page, D., W. Dupont, L. Rogers, and M. Landenberger (1982)    Intraductal carcinoma of the breast: follow-up after biopsy only.    Cancer 49, 751-758.-   Parl, F. F. (2000) Estrogens, Estrogen Receptor, and Breast Cancer.    (IOS Press)-   Perou, C. M., S. S. Jeffrey, M. van de Rijn, C. A. Rees, M. B.    Eisen, D. T. Ross, A. Pergemenschikov, C. F. Williams, S. X.    Zhu, J. C. F. Lee, D. Lashkari, D. Shalon, P. O. Brown, and D.    Botstein (1999) Distinctive gene expression patterns in human    mammary epithelial cells and breast cancers. Proc Natl Acad Sci 96,    9212-9217-   Perou, C. M., T. Sorlie, M. B. Eisen, v. d. R. M., S. S.    Jeffrey, C. A. Rees, J. R. Pollack, D. T. Ross, H. Johnsen, L. A.    Akslen, O. Fluge, A. Pergamenschikov, C. Williams, S. X. Zhu, P. E.    Lonning, A. L. Borresen-Dale, P. O. Brown, and D. Botstein (2000)    Molecular Portraits of Human Breast Tumours. Nature 406, 747-752-   Rhodes, D. R., T. R. Barrette, M. A. Rubin, D. Ghosh, and A. M.    Chinnaiyan (2002) Meta-analysis of Microarrays: Interstudy    Validation of Gene Expression Profiles Reveals Pathway Dysregulation    in Prostate Cancer. Cancer Research 62, 4427-4433-   Rosen, P., D. Braun, and D. Kinne (1980) The clinical significance    of pre-invasive breast carcinoma. Cancer 46, 919-925-   Shafaatian, R., M. A. Payton, and J. D. Reid (1996) PWP2, a member    of the WD-repeat family of proteins, is an essential Saccharomyces    cerevisiae gene involved in cell separation. Mol Gen Genet. 252,    101-114-   Smalley, M. J. and T. C. Dale (2001) Wnt signaling and mammary    tumorigenesis. J Mammary Gland Biol Neoplasia 6, 37-52-   Sorlie, T., C. M. Perou, R. Tibshirani, T. Aas, S. Geisler, H.    Johnsen, T. Hastie, M. B. Eisen, M. van de Rijn, S. S. Jeffrey, T.    Thorsen, H. Quist, J. C. Matese, P. O. Brown, D. Botstein, P. E.    Lonning, and A. L. Borresen-Dale (2001) Gene Expression Patterns of    Breast Carcinomas Distinguish Tumor Subclasses with Clinical    Implications. Proc. Natl. Acad. Sci. 98, 10879-10874-   Tan, P. H., K. L. Chuah, G. Chiang, C. Y. Wong, F. Dong, and B. H.    Bay (2002) Correlation of p53 and cerbB2 expression and hormonal    receptor status with clinicopathological parameters in ductal    carcinoma in situ of the breast. Oncology Reports 9, 1081-1086-   Tavassoli, F. A. and S. J. Schnitt (1992) Pathology of the Breast.    In (Elsevier)-   Tusher, V. G., R. Tibshirani, and G. Chu (2001) Significance    Analysis of Microarrays Applied to the Ionizing Radiation Response.    Proc. Natl. Acad. Sci. 98, 5116-5121-   van't Veer, L. J., H. Dai, M. J. van de Vijver, Y. D. He, A. A. M.    Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T.    Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S.    Linsley, R. Bernards, and S. H. Friend (2002) Gene expression    profiling predicts clinical outcome of breast cancer. Nature 415,    530-536-   Villalobos, M., d. Becerra, M. I. Nunez, M. T. Valenzuela, E.    Siles, N. Olea, V. Pedraza, and J. M. Ruiz de Almodovar (1996)    Radiosensitivity of human breast cancer cell lines of different    hormonal responsiveness. Modulatory effects of oestradiaol. Int J    Radiat Biol 70, 161-169-   Wang, E., L. D. Miller, G. A. Ohnmacht, E. T. Liu, and F. M.    Marincola (2000) High-fidelity mRNA amplification for gene    profiling. Nature Biotech. 18, 457-459-   Wary, K. K., A. Mariotti, c. Zurzolo, and F. G. Giancotti (1998) A    requirement for caveolin-1 and associated kinase Fyn in integrin    signaling and anchorage-dependent cell growth. Cell 94, 625-634

Wiechen, K., L. Diatchenko, A. Agoulnik, K. M. Scharff, H. Schober, K.Arlt, B. Zhumabayeva, P. D. Siebert, M. Dietel, R. Schafer, and C. Sers(2001) Caveolin-1 is down-regulated in human ovarian carcinoma and actsas a candidate tumour suppressor gene. Am J Pathol. 159, 1635-1643 TABLE1 Common Genes in Both Normal and Tumour Datasets Unigene Accession NCCID ID No GeneName Annotation 2914401 Hs.151738 NM_004994 MMP9 matrixmetalloproteinase 9 (gelatinase B, 92 kD gelatinase, 92 kD type IVcollagenase) 2957001 Hs.50758 BF239180 SMC4L1 SMC4 (structuralmaintenance of chromosomes 4, yeast)-like 1 3080701 Hs.279009 BF679062MGP matrix Gla protein 3080801 Hs.98428 NM_018952 HOXB6 homeo box B63082201 Hs.211573 NM_005529 HSPG2 heparan sulfate proteoglycan 2(perlecan) 3085601 Hs.156110 AW404507 IGKC immunoglobulin kappa constant3119301 Hs.78045 NM_001615 ACTG2 actin, gamma 2, smooth muscle, enteric3174801 Hs.95972 BE892678 SILV silver (mouse homolog) like 3296301Hs.153952 AW072424 NT5 5′ nucleotidase (CD73) 3390901 Hs.572 X02544 ORM1orosomucoid 1 3401301 Hs.155421 AA334619 AFP alpha-fetoprotein 3404301Hs.25817 AW195430 BTBD2 BTB (POZ) domain containing 2 3437301 Hs.78771AI525579 PGK1 phosphoglycerate kinase 1 3451301 Hs.56205 AW663903 INSIG1insulin induced gene 1 3610001 Hs.30743 AI017284 PRAME preferentiallyexpressed antigen in melanoma 3617301 Hs.10842 AF052578 RAN RAN, memberRAS oncogene family 3619101 Hs.337764 AB038162 NA trefoil factor 13767201 Hs.274184 AF207550 TFE3 transcription factor binding to IGHMenhancer 3 3812201 Hs.914 X03100 AGL Human mRNA for SB classIIhistocompatibility antigen alpha-chain 3955201 Hs.19710 H60423 SLC17A2solute carrier family 17 (sodium phosphate), member 2 4021001 Hs.2055AA232386 UBE1 ubiquitin-activating enzyme E1

TABLE 2 Genes found in the minimal breast cancer genetic identifierAccession On in NCC ID Unigene ID No Genename Annotation Tumour 2920901Hs.76530 AU121309 F2 coagulation factor II (thrombin) N 2933601Hs.278411 AB014509 NCKAP1 NCK-associated protein 1 N 2934801 Hs.79380AP001753 PWP2H PWP2 homolog N 2936101 Hs.1940 AV733563 CRYAB crystallin,alpha B N 2987501 Hs.75736 J02611 APOD apolipoprotein D N 3041201Hs.295944 BG621010 TFPI2 tissue factor pathway inhibitor 2 N 3110601Hs.74034 BG541572 CAV1 caveolin 1, caveolae protein, 22 kD N 3119401Hs.184411 AL558086 ALB albumin N 3143701 Hs.156346 NM_001067 TOP2Atopoisomerase (DNA) II alpha (170 kD) N 3401301 Hs.155421 AA334619 AFPalpha-fetoprotein N 2919801 Hs.177766 BE740909 ADPRTADP-ribosyltransferase (NAD+; poly Y (ADP-ribose) polymerase) 2930501Hs.265829 D01038 ITGA3 integrin, alpha 3 (antigen CD49C, Y alpha 3subunit of VLA-3 receptor) 2961201 Hs.4437 AU131942 RPL28 ribosomalprotein L28 Y 3048301 Hs.4943 BE891065 MAGED2 hepatocellular carcinomaassociated Y protein; breast cancer associated gene 1 3085601 Hs.156110AW404507 IGKC immunoglobulin kappa constant Y 3119301 Hs.78045 NM_001615ACTG2 actin, gamma 2, smooth muscle, Y enteric 3124401 Hs.145279NM_003011 SET SET translocation (myeloid Y leukemia-associated) 3134101Hs.73885 088244 HLA-G HLA-G histocompatibility antigen, Y class I, G3193001 Hs.84298 BE741354 CD74 CD74 antigen (invariant polypeptide Y ofmajor histocompatibility complex, class II antigen-associated) 3296401Hs.183601 U70426 RGS16 regulator of G-protein signalling 16 YGenes are ordered according to their correlation to the tumour/normalclass distinction.

TABLE 3 Tabulation of expression signatures associated with breast tumorsubtypes. Subclasses include Luminal A (L-A_, Luminal B (L-B), Luminal C(L-C_, Basal (Bas), Normal like (Nor), ERBB2 (ERB). Levels of expressionare indicated by H (high expression), I (intermediate expression), and A(absent expression). Tumor subtype Expression Signature Unigene L-A L-BL-C Bas Nor ERB Luminal Epithelium H I I A A A estrogen receptor 1Hs.1657 GATA binding protein 3 Hs.169946 LIV-1 Hs.79136 Xbox bindingprotein 1 Hs.149923 Hepatocyte Nuclear Factor 3 alpha Hs.299867 BasalEpithelium A A A H H A Keratin5 Hs.195850 Keratin17 Hs.2785 Laminingamma 2 Hs.54451 Fatty acid binding protein 7 Hs.26770 erbb2 relatedgenes A A A A A H C-ERB-B2 Hs.323910 GRB7 Hs.86859 TIAF1 Hs.75822 TRAF4Hs.8375 Normal breast like A A A A H A CD36 antigen collagen type 1receptor Hs.75613 Four and a half LIM domain 1 Hs.239069 vascularadhesion protein 1 Hs.198241 alcohol dehydrogenase 2 class 1 Hs.4 NovelA A H H A I kinesin-like 5 mitotic kinesin-like protein 1 Hs.270845putative integral membrane transporter Hs.296398 gamma-glutamylhydrolase conjugase Hs.78619 squalene epoxidase Hs.71465

TABLE 4a Set of 49 Genes Upregulated in Tumors and 81 Genes Upregulatedin Normals Upregulated in tumors Normal_(—) Tumor_(—) Fold change ProbeGene Description UniGene GeneBank median median (normal/tumor) P-value221730_at collagen, type V, alpha 2 Hs.82985 NM_000393.1  2989.3422050.38 0.135568639 6.53E−08 205483_(—) interferon-stimulated Hs.833NM_005101.1  3440.12 19587.87 0.175625017 2.89E−09 s_at protein, 15 kDa201422_at interferon, gamma- Hs.14623 NM_006332.1  4216.08 22685.340.185850421 5.13E−11 inducible protein 30 202311_(—) collagen, type I,alpha 1 Hs.172928 NM_000088.1  2309.8 11583.18 0.199409834 5.47E−08 s_at214290_(—) H2A histone family, Hs.795 AA451996  8270.53 34668.820.238558163 0.000011 s_at member O 204170_(—) CDC28 protein kinase 2Hs.83758 NM_001827.1  2364.5  9307.97 0.254029611 2.44E−09 s_at204620_(—) chondroitin sulfate Hs.81800 NM_004385.1  8494.23 31700.60.267951711 1.64E−10 s_at proteoglycan 2 (versican) 201261_(—) biglycanHs.821 BC002416.1  3832.74 14200.24 0.269906706 2.96E−10 x_at 221731_(—)chondroitin sulfate Hs.81800 J02814.1 10044.24 36814.75 0.2728319491.97E−09 x_at proteoglycan 2 (versican) 203936_(—) matrixmetalloproteinase 9 Hs.151738 NM_004994.1  2908.93 10635.99 0.273498753 1.4E−06 s_at (gelatinase B, 92 kD gelatinase, 92 kD type IVcollagenase) 213909_at Homo sapiens cDNA FLJ12280 Hs.288467 AU147799 2270.33  8261.75 0.274800133 2.93E−07 fis, clone MAMMA1001744204619_(—) chondroitin sulfate Hs.81800 BF590263  1679.69  5982.220.280780379  4.7E−07 s_at proteoglycan 2 (versican) 213905_(—) biglycanHs.821 AA845258  5025.39 17320.39 0.290143005 6.45E−10 x_at 203362_(—)MAD2 mitotic arrest Hs.79078 NM_002358.2  1126.73  3794.7 0.2969220234.29E−07 s_at deficient-like 1 (yeast) 209596_at adlican Hs.72157AF245505.1  9872.98 31833.51 0.310144247 9.57E−06 217762_(—) RAB31,member RAS oncogene Hs.223025 BE789881  6239.5 20080.05 0.3107312988.96E−07 s_at family 212353_at sulfatase FP Hs.70823 AW043713  3298.1310610.47 0.310837314 2.29E−07 221729_at collagen, type V, alpha 2Hs.82985 NM_000393.1  8089.9 25965.7 0.311561021 1.79E−08 202503_(—)KIAA0101 gene product Hs.81892 NM_014736.1  4140.8 13277.67 0.3118619468.17E−09 s_at 200660_at S100 calcium binding Hs.256290 NM_005620.119359.81 60412.84 0.320458532 1.37E−08 protein A11 (calglzzarin)210046_(—) isocitrate dehydrogenase 2 Hs.5337 U52144.1  6598.83 20503.10.321845477 2.19E−06 s_at (NADP+), mitochondrial 218039_at nucleolarprotein ANKT Hs.279905 NM_016359.1  2649.43  8088.17 0.3275685354.71E−08 200838_at cathepsin B Hs.297939 NM_001908.1  8903.1 26015.640.342221064 5.79E−09 208850_(—) Thy-1 cell surface antigen Hs.125359AL558479  3334.94  9742.28 0.342316172 1.02E−07 s_at 215438_(—) G1 to Sphase transition 1 Hs.2707 BE906054  3749.34 10880.78 0.344583752 2.4E−07 x_at 213274_(—) cathepsin B Hs.297939 BE875786  5290.8815121.92 0.349881497 9.49E−10 s_at 214352_(—) v-Ki-ras2 Kirsten ratHs.351221 BF673699  8905.97 25327.68 0.351629916 4.28E−13 s_at sarcoma 2viral oncogene homolog 208691_at transferrin receptor Hs.77356BC001188.1 10599.34 30095.24 0.352193237 1.63E−06 (p90, CD71) 211161_(—)collagen, type III, Hs.119571 AF130082.1 16874.98 47522.98 0.355090948 4.8E−07 s_at alpha 1 (Ehlers-Danlos syndrome type IV, autosomaldominant) 200887_(—) signal transducer and Hs.21486 NM_007315.1 11865.133057.82 0.358919614 2.31E−07 s_at activator of transcription 1, 91 kD222077_(—) Rac GTPase activating Hs.23900 AU153848  2198.49  6100.350.360387519 1.65E−08 s_at protein 1 212057_at KIAA0182 protein Hs.75909D80004.1  5085.42 14109.59 0.360422946 9.01E−06 222039_at hypotheticalprotein Hs.274448 AA292789   985.61  2733.2 0.360806615 6.79E−06FLJ11029 202391_at brain abundant, membrane Hs.79516 NM_006317.1 6613.73 18202.02 0.36335143 1.85E−06 attached signal protein 1222158_(—) CGI-146 protein Hs.42409 AF229834.1  2670.29  7278.070.366895345 1.63E−06 s_at 214435_(—) v-ral simian leukemia Hs.288757NM_005402.1  1882.24  5097.71 0.369232459  2.9E−09 x_at viral oncogenehomolog A (ras related) 208998_at uncoupling protein 2 Hs.80658 U94592.110979.98 29619.79 0.370697429  2.5E−08 (mitochondrial, proton carrier)205436_(—) H2A histone family, Hs.147097 NM_002105.1  4050.78 10910.210.371283413 2.31E−08 s_at member X 209218_at squalene epoxidase Hs.71465AF098865.1  4862.95 12883.73 0.377448922 2.68E−06 219148_at T-LAKcell-originated Hs.104741 NM_018492.1   783.67  2061.19 0.3802026981.27E−05 protein kinase 214710_(—) cyclin B1 Hs.23960 BE407516  1750.12 4576.64 0.382402811 1.41E−06 s_at 202736_(—) U6 snRNA-associatadHs.76719 NM_012321.1  3258.86  8432.11 0.38648215  7.8E−07 s_at Sm-likeprotein 201954_at actin related protein Hs.11538 NM_005720.1  5792.3214857.02 0.389870916 1.98E−09 ⅔ complex, subunit 1B (41 kD) AFFX-HUMISGF3A/ M97935_(—) signal transducer and Hs.21486 M97935  8912.2722688.41 0.392811572 7.83E−08 3_at activator of transcription 1, 91 kD202954_at ubiquitin-conjugating Hs.93002 NM_007019.1  3982.35 10133.970.392970376 1.13E−06 enzyme E2C 209945_(—) glycogen synthase Hs.78802BC000251.1  2414.33  6121.16 0.394423606 4.26E−08 s_at kinase 3 beta213553_(—) apolipoprotein C-I Hs.268571 W79394  6342.73 15981.270.396885229 6.13E−06 x_at 210004_at oxidised low density Hs.77729AF035776.1   929.49  2322.52 0.400207533 9.33E−06 lipoprotein(lectin-like) receptor 1 208091_(—) hypothetical protein Hs.4750NM_030796.1  7908.33 19735.4 0.400717999 4.32E−09 s_at DKFZp564K0822Upregulated in normals Normal_(—) Ttumor_(—) Fold change Gene Name GeneDescription UniGene GeneBank median median (nor

P-value 202037_(—) secreted frizzled-related Hs.7306 NM_003012.259365.66  5359.35 11.07702613 7.16E−11 s_at protein 1 212730_at KIAA0353protein Hs.10587 AK026420.1 46331.26  4401.76 10.52562157 1.72E−12205051_(—) v-kit Hardy-Zuckerman 4 Hs.81665 NM_000222.1 30870.31 3453.96  8.937657066 1.28E−11 s_at feline sarcoma viral oncogenehomolog 203881_(—) dystrophin (muscular Hs.169470 NM_004010.1  9702.27 1267.79  7.652899928 5.88E−17 s_at dystrophy, Duchenne and Beckertypes) 209292_at inhibitor of DNA binding Hs.34853 NM_001546.1  6037.09  864.39  6.984220086 8.13E−11 4, dominant negative helix-loop-helixprotein 209291_at inhibitor of DNA binding Hs.34853 NM_001546.1 19487.35 2908.02  6.701243458 7.26E−09 4, dominant negative helix-loop-helixprotein 202035_(—) secreted frizzled-related Hs.7306 AI332407  8226.47 1233.99  6.666581317  1.2E−05 s_at protein 1 206825_at oxytocinreceptor Hs.2820 NM_000916.2 14315.07  2188.79  6.540175165 2.48E−15218706_(—) hypothetical protein Hs.235445 AW575493 15578.77  2719.59 5.728352435 1.21E−13 s_at FLJ21313 202350_(—) matrilin 2 Hs.19368NM_002380.2 11301.25  2099.9  5.381803895 2.25E−07 s_at 211737_(—)pleiotrophin (heparin Hs.44 BC005916.1 19118.74  3681.29  5.1934892391.98E−09 x_at binding growth factor 8, neurite growth-promotingfactor 1) 209863_(—) tumor protein p63 Hs.137569 AF091627.1 15557.74 3073.13  5.062506305 5.23E−12 s_at 218087_(—) SH3-domain protein 5Hs.108924 NM_015385.1  7983.63  1692.15  4.718039181 1.17E−12 s_at(ponsin) 219795_at solute carrier family 6 Hs.162211 NM_007231.1 3443.96   767.46  4.487478175 3.52E−06 (neuro-transmitter transporter),member 14 202342_(—) tripartite motif- Hs.12372 NM_015271.1  8892.84 2088.2  4.258615075 5.46E−07 s_at containing 2 209290_(—) nuclearfactor I/B Hs.33287 BC001283.1 51664.48 12407.42  4.16399864 3.45E−06s_at 213029_at Homo sapiens mRNA; cDNA Hs.326416 AL110126.1 31908.67 7680.26  4.154634088 1.19E−10 DKFZp564H1916 (from clone DKFZp564H1916)203706_(—) frizzled homolog 7 Hs.173859 NM_003507.1 19052.38  4610.75 4.132165049  3.3E−07 s_at (Drosophila) 209392_at ectonucleotideHs.174185 L35594.1 12733.37  3091.99  4.118179554 9.92E−10pyrophosphatase/ phosphodiesterase 2 (autotaxin) 214598_at claudin 8Hs.162209 AL049977.1  8208.2  1993.78  4.11690357  7.3E−07 203065_(—)caveolin 1, caveolae Hs.74034 NM_001753.2 15611.14  3827.36  4.0788271811.67E−12 s_at protein, 22 kD 204731_at transforming growth Hs.342874NM_003243.1 12204.26  3072.8  3.971706587 5.14E−06 factor, beta receptorIII (betaglycan, 300 kD) 218330_(—) retinoic acid inducible Hs.23467NM_018162.1 12668.28  3289.49  3.851138018 2.24E−08 s_at inneuroblastoma 203323_at caveolin 2 Hs.139851 BF197655 11789.6  3069.88 3.8404107   1E−15 218804_at hypothetical protein Hs.26176 NM_018043.112822.63  3377.19  3.796834054 1.74E−06 FLJ10261 206481_(—) LIM domainbinding 2 Hs.4980 NM_001290.1  7116.81  1895.62  3.754344225 1.03E−09s_at 208370_(—) Down syndrome critical Hs.184222 NM_004414.2 21019.72 5602.52  3.751833104  7.5E−07 s_at region gene 1 211726_(—) flavincontaining Hs.132821 BC005894.1 17812.59  4796.43  3.713718328 3.49E−08s_at monooxygenase 2 201012_at annexin A1 Hs.78225 NM_000700.1 41241.8511106.89  3.713177136 3.91E−10 212097_at caveolin 1, caveolae Hs.74034AU147399 23596.76  6367.19  3.705992753 3.08E−15 protein, 22 kD209170_(—) glycoprotein M6B Hs.5422 AF016004.1  8790.1  2373.92 3.702778527 2.01E−07 s_at aldo-keto reductase family 1, member C3(3-alpha hydroxysteroid 209160_at dehydrogenase, type II) Hs.78183AB018580.1  6068.7  1643.09  3.693467795 2.12E−07 202746_at Integralmembrane protein Hs.17109 AL021786 14250.79  3939.27  3.6176220472.69E−10 2A 209894_at leptin receptor Hs.226627 U50748.1  3660.94 1016.43  3.601763033  5.5E−11 203324_(—) caveolin 2 Hs.139851NM_001233.1  6068.91  1715.26  3.538186631 2.97E−10 s_at 204719_atATP-binding cassette, Hs.38095 NM_007168.1  4833.57  1388.04 3.482298781 5.56E−08 sub-family A (ABC1), member 8 203549_(—)lipoprotein lipase Hs.180878 NM_000237.1 10789.01  3131.46  3.445360959.05E−11 s_at 206115_at early growth response 3 Hs.74088 NM_004430.112017.1  3516.09  3.41774528 5.81E−06 219935_at a disintegrin-like andHs.58324 NM_007038.1  8376.24  2753.5  3.405207917 3.35E−12metalloprotease (reprolysin type) with thrombospondin type 1 motif, 5(aggrecanase-2) 201656_at integrin, alpha 6 Hs.227730 NM_000210.1 9626.26  2893.95  3.326339432 4.04E−07 205463_(—) platelet-derivedgrowth Hs.37040 NM_002607.1  8648.24  2619.44  3.301560639 3.12E−12 s_atfactor alpha polypeptide 823_at small inducible cytokine Hs.80420 U8448712990.21  3946.33  3.291719142  8.6E−07 subfamily D (Cys-X3-Cys), member1 (fractalkine, neurotactin) 213032_at Homo sapiens mRNA; cDNA Hs.326416AL110126.1 12729.9  3880.97  3.280082041 8.56E−06 DKFZp564H1916 (fromclone DKFZp564H1916) 217047_(—) KIAA0914 gene product Hs.177664AK027138.1  9278.12  2871.79  3.230779409 5.28E−09 s_at 209465_(—)pleiotrophin (heparin Hs.44 AL565812  7512.2  2334.46  3.2179604717.53E−08 x_at binding growth factor 8, neurite growth-promotingfactor 1) 207808_(—) protein S (alpha) Hs.64016 NM_000313.1  5027.75 1573.15  3.195976226  1.7E−09 s_at 209289_at nuclear factor I/BHs.33287 AI700518 43037.8 13478.56  3.193056232 3.62E−06 209185_(—)insulin receptor Hs.143648 AF073310.1 19990.69  6334.2  3.1559928641.39E−06 s_at substrate 2 202552_(—) cysteine-rich motor Hs.19280NM_016441.1  8386.55  2721.46  3.081636328 8.31E−09 s_at neuron 1203688_at polycystic kidney Hs.82001 NM_000297.1  7543.97  2462.41 3.063653088 3.73E−10 disease 2 (autosomal dominant) 222162_(—) adisintegrin-like and Hs.8230 AK023795.1 10496.22  3485.94  3.011015683.81E−06 s_at metalloprotease (reprolysin type) with thrombospondin type1 motif, 1 211685_(—) neurocalcin delta Hs.90063 AF251061.1  9352.32 3133.91  2.984233753 1.78E−08 s_at 213900_at Friedreich ataxia regionHs.77889 AA524029 11954.68  4037.3  2.961058133 1.26E−11 gene X123222372_at ESTs Weakly similar to Hs.291289 AW971248  8049.26  2718.48 2.960941408 4.62E−06 ALU1_HUMAN ALU SUBFAMILY J SEQUENCE CONTAMINATIONWARNING ENTRY [H. sapiens] 201540_at four and a half LIM Hs.239069NM_001449.1 17627.89  6015.25  2.930533228 4.28E−08 domains 1 212254_(—)bullous pemphigoid Hs.198689 BG253119 19972.78  6991.03  2.8569152191.32E−09 s_at antigen 1 (230/240 kD) 213353_at ATP-binding cassette,Hs.180513 BF693921  5730.62  2019.34  2.837867818 3.71E−10 sub-family A(ABC1), member 5 205498_at growth hormone receptor Hs.125180 NM_000163.1 7384.79  2603.42  2.836572662 4.63E−06 215016_(—) bullous pemphigoidHs.198689 BC004912.1 19089.82  6747.39  2.829215445 3.72E−09 x_atantigen 1 (230/240 kD) 208944_at transforming growth Hs.82028 D50683.118938.86  6698.52  2.827320065 7.59E−12 factor, beta receptor II (70-80kD) 210839_(—) ectonucleotide Hs.174185 D45421.1  7024.74  2493.07 2.817706683 4.26E−13 s_at pyrophosphatase/ phosphodiesterase 2(autotaxin) 218901_at phospholipid scramblase Hs.182538 NM_020353.1 8923.62  3169.64  2.815341805 1.56E−10 4 209466_(—) pleiotrophin(neparin Hs.44 M57399.1 18099.82  6464.73  2.799779728 4.27E−08 x_atbinding growth factor 8, neurite growth-promoting factor 1) 200795_atSPARC-like 1 (mast9, Hs.75445 NM_004684.1 62309.15 22325.59  2.7909296014.78E−07 hevin) 202973_(—) KIAA0914 gene Hs.177664 NM_014883.1 11301.89 4053.46  2.788208099  4.1E−07 x_at product 218723_(—) RGC32 proteinHs.76640 NM_014059.1 13133.05  4722.25  2.781100111 2.13E−07 s_at213375_(—) hypothetical gene Hs.22174 N80918  9894.2  3571.88 2.770025869 2.77E−09 s_at CG018 221841_(—) Kruppel-like factorHs.356370 BF514078 17464.66  6347.92  2.751241351  1.3E−06 s_at 4 (gut)218276_(—) WW45 protein Hs.288906 NM_021818.1  6994.97  2552.32 2.740832052 4.14E−09 s_at 212463_at Homo sapiens mRNA; cDNA Hs.99766BE379006 23386.73  8711.13  2.684695327 2.02E−08 DKFZp564J0323 (fromclone DKFZp564J0323) 213486_at hypothetical protein Hs.6421 BF435376 4412.93  1649.6  2.675151552 2.78E−14 DKFZp761N09121 206306_atryanodine receptor 3 Hs.9349 NM_001036.1  2449.43   926.73  2.6430891413.38E−09 212675_(—) KIAA0582 protein Hs.79507 AB011154.1  6645.48 2532.1  2.624493503 4.88E−12 s_at 200762_at dihydropyrimidinase-Hs.173381 NM_001386.1 24509.97  9355.96  2.619717271  1.4E−08 like 2207480_(—) Meis1, myeloid ecotropic Hs.104105 NM_020149.1  5180.76 2010.23  2.577197634 2.37E−07 s_at viral integration site 1 homolog 2(mouse) 219091_(—) EMILIN-like protein Hs.127216 NM_024756.1  6277.33 2442.04  2.5705271 4.58E−13 s_at EndoGlyx-1 219304_(—) spinalcord-derived Hs.112885 NM_025208.1 10905.82  4319.06  2.5250448019.33E−10 s_at growth factor-B 207542_(—) aquaporin 1 (channel- Hs.74602NM_000385.2  8557.32  3405.56  2.512749739 8.69E−07 s_at formingintegral protein, 28 kD) 211998_at H3 histone, family 38 Hs.180877NM_005324.1 10030.86  3995.83  2.510332021 8.65E−06 (H3.3B) 204115_atguanine nucleotide Hs.83381 NM_004126.1  5852.14  2337.15  2.503964232.41E−07 binding protein 11 202016_at mesoderm specific Hs.70284NM_002402.1 21998.29  8805.67  2.498196049 1.05E−07 transcript homolog(mouse)Probe = Affymetrix Probe SequenceDescription = Gene name and annotationUnigene = Unigene Number (NCBI)Genbank = Genbank Accession NumberMedian = Median expression value in Normals or TumorsFold change = Ratio of expression values (normals/tumors)P-value = t-test significance

TABLE 4b Minimal Geneset for the Classification of Normal vs Tumor ProbeGene Description UniGene GeneBank Upregulated in Tumors 201954_at actinrelated protein ⅔ complex, subunit 1B (41 kD) Hs.11538 NM_005720.1213905_x_at biglycan Hs.821 AA845258 201261_x_at biglycan Hs.821BC002416.1 202391_at brain abundant, membrane attached signal protein 1Hs.79516 NM_006317.1 205483_s_at interferon-stimulated protein, 15 kDaHs.833 NM_005101.1 221729_at collagen, type V, alpha 2 Hs.82985NM_000393.1 211161_s_at collagen, type III, alpha 1 (Ehlers-Danlossyndrome type IV, Hs.119571 AF130082.1 autosomal dominant) 201422_atinterferon, gamma-inducible protein 30 Hs.14623 NM_008332.1 203936_s_atmatrix metalloproteinase 9 (gelatinase B, 92 kD gelatinase, Hs.151738NM_004994.1 92 kD type IV collagenase) 210004_at oxidised low densitylipoprotein (lectin-like) receptor 1 Hs.77729 AF035776.1 208998_atuncoupling protein 2 (mitochondrial, proton carrier) Hs.80658 U94592.1222039_at hypothetical protein FLJ11029 Hs.274448 AA292789 Upregulatedin Normals 209160_at aldo-keto reductase family 1, member C3 (3-alphaHs.78183 AB018580.1 hydroxysteroid dehydrogenase, type II) 201012_atannexin A1 Hs.78225 NM_000700.1 204719_at ATP-binding cassette,sub-family A (ABC1), member 8 Hs.38095 NM_007168.1 221841_s_atKruppel-like factor 4 (gut) Hs.356370 BF514079 210839_s_atectonucleotide pyrophosphatase/phosphodiesterase 2 Hs.174185 D45421.1(autotaxin) 209392_at ectonucleotide pyrophosphatase/phosphodiesterase 2Hs.174185 L35594.1 (autotaxin) 201540_at four and a half LIM domains 1Hs.239069 NM_001449.1 202342_s_at tripartite motif-containing 2 Hs.12372NM_015271.1 209185_s_at insulin receptor substrate 2 Hs.143648AF073310.1 209894_at leptin receptor Hs.226627 U50748.1 206481_s_at LIMdomain binding 2 Hs.4980 NM_001290.1 202016_at mesoderm specifictranscript homolog (mouse) Hs.79284 NM_002402.1 209290_s_at nuclearfactor I/B Hs.33287 BC001283.1 218901_at phospholipid scramblase 4Hs.182538 NM_020353.1 209466_x_at pleiotrophin (heparin binding growthfactor 8, Hs.44 M57399.1 neurite growth-promoting factor 1) 211737_x_atpleiotrophin (heparin binding growth factor 8, Hs.44 BC005916.1 neuritegrowth-promoting factor 1) 202037_s_at secreted frizzled-related protein1 Hs.7306 NM_003012.2 205051_s_at v-kit Hardy-Zuckerman 4 feline sarcomaviral oncogene Hs.81665 NM_000222.1 homolog 212730_at KIAA0353 proteinHs.10587 AK026420.1 218330_s_at retinoic acid inducible in neuroblastomaHs.23467 NM_018162.1

TABLE 5A CGS for ER and ERBB2 Classification ER Classification GenesProbe Gene Name Unigene Gen Bank Regulation 205225_at estrogen receptor1 Hs.1657 NM_000125.1 + 203963_at carbonic anhydrase XII Hs.5338NM_001218.2 + 209602_s_at GATA binding protein 3 Hs.169946 AI796169 +214164_x_at adaptor-related protein complex 1, gamma 1 subunit Hs.5344BF752277 + 202089_s_at LIV-1 protein, estrogen regulated Hs.79136NM_012319.2 + 212956_at KIAA0882 protein Hs.90419 AB020689.1 + 214440_atN-acetyltransferase 1 (arylamine N-acetyltransferase) Hs.165956NM_000662.1 + 206754_s_at cytochrome P450, subfamily IIB(phenobarbital-inducible), Hs.1360 NM_000767.2 + polypeptide 6222212_s_at LAG1 longevity assurance homolog 2 (S. cerevisiae) Hs.285976AK001105.1 + 218195_at hypothetical protein FLJ12910 Hs.15929NM_024573.1 + 205862_at KIAA0575 gene product Hs.193914 NM_014668.1 +212195_at Homo sapiens mRNA; cDNA DKFZp564F053 (from Hs.71968AL049265.1 + clone DKFZp564F053) 208682_s_at melanoma antigen, family D,2 Hs.4943 AF126181.1 + 202342_s_at tripartite motif-containing 2Hs.12372 NM_015271.1 − 209459_s_at NPD009 protein Hs.283675 AF237813.1 +201037_at phosphofructokinase, platelet Hs.99910 NM_002627.1 −203571_s_at adipose specific 2 Hs.74120 NM_006829.1 + 214088_s_atfucosyltransferase 3 (galactoside 3(4)-L-fucosyltransferase, Hs.169238AW080549 − Lewis blood group included) 201976_s_at myosin X Hs.61638NM_012334.1 − 218502_s_at trichorhinophalangeal syndrome I Hs.26102NM_014112.1 + 203221_at transducin-like enhancer of split 1 (E(sp1)homolog, Hs.28935 AI951720 − Drosophila) 207002_s_at pleiomorphicadenoma gene-like 1 Hs.75825 NM_002656.1 − 207030_s_at cysteine andglycine-rich protein 2 Hs.10526 NM_001321.1 − 204623_at trefoil factor 3(intestinal) Hs.352107 NM_003226.1 + 205009_at trefoil factor 1 (breastcancer, estrogen-inducible Hs.350470 NM_003225.1 + sequence expressedin)Regulation = On (+) or Off (−) in an ER+ tumor

TABLE 5B ERBB2 Classification Genes Probe Gene Name Unigene GenBankRegulation 216836_s_at v-erb-b2 erythroblastic leukemia viral oncogenehomolog 2, Hs.323910 X03363.1 + neuro/glioblastoma derived oncogenehomolog (avian) 210761_s_at growth factor receptor-bound protein 7Hs.86859 AB008790.1 + 202991_at steroidogenic acute regulatory proteinrelated Hs.77628 NM_006804.1 + 55616_at hypothetical gene MGC9753Hs.91668 AI703342 + 214203_s_at proline dehydrogenase (oxidase) 1Hs.343874 AA074145 + 213557_at KIAA0904 protein Hs.278346 AW305119 +220149_at hypothetical protein FLJ22671 Hs.193745 NM_024861.1 +215659_at Homo sapiens cDNA: FLJ21521 fis, clone COL0588O Hs.306777AK025174.1 + 219233_s_at hypothetical protein PRO2521 Hs.19054NM_018530.1 + 203497_at PPAR binding protein Hs.15589 NM_004774.1 +219226_at CDC2-related protein kinase 7 Hs.123073 NM_016507.1 +202712_s_at creatine kinase, mitochondrial 1 (ubiquitous) Hs.153998NM_020990.2 + 204285_s_at phorbol-12-myristate-13-acetate-inducedprotein 1 Hs.96 AI857639 − 205225_at estrogen receptor 1 Hs.1657NM_000125.1 − 214614_at homeo box HB9 Hs.37035 AI738662 + 202917_s_atS100 calcium binding protein A8 (calgranulin A) Hs.100000 NM_002964.2 +219429_at fatty acid hydroxylase Hs.249163 NM_024306.1 + 208614_s_atfilamin B, beta (actin binding protein 278) Hs.81008 M62994.1 −204029_at cadherin, EGF LAG seven-pass G-type receptor 2 (flamingoHs.57652 NM_001408.1 − homolog, Drosophila) 216401_x_at Homo sapienspartial IGKV gene for Immunoglobulin Hs.307136 AJ408433 + kappa chainvariable region, clone 38 203685_at B-cell CLL/lymphoma 2 Hs.79241NM_000633.1 − 216576_x_at Homo sapiens isolate donor. N clone N88KHs.247910 AF103529.1 + Immunoglobulin kappa light chain variable regionmRNA, partial cds 211138_s_at kynurenine 3-monooxygenase (kynurenine3-hydroxylase) Hs.107318 BC005297.1 + 202039_at TGFB1-inducedanti-apoptotic factor 1 Hs.78822 NM_004740.1 + 203627_at insulin-likegrowth factor 1 receptor Hs.239176 NM_000875.2 − 204863_s_at interleukin6 signal transducer (gp130, oncostatin Hs.82065 BE856546 − M receptor)

TABLE 6a Predictor Sets for Molecular Subtype Using OVA SVM Luminal AProbe Gene Description UniGene GeneBank 201030_x_at lactatedehydrogenase B Hs.234489 NM_002300.1 201525_at apolipoprotein DHs.75736 NM_001647.1 201688_s_at tumor protein D52 Hs.2384 BE974098201754_at cytochrome c oxidase subunit Vic Hs.351875 NM_004374.1202376_at serine (or cysteine) proteinase inhibitor, clade A Hs.234726NM_001085.2 (alpha-1 antiproteinase, antitrypsin), member 3 202555_s_atmyosin, light polypeptide kinase Hs.211582 NM_005965.1 202746_atIntegral membrane protein 2A Hs.17109 AL021786 202991_at steroidogenicacute regulatory protein related Hs.77628 NM_006804.1 203627_atinsulin-like growth factor 1 receptor Hs.239176 NM_000875.2 203749_s_atretinoic acid receptor, alpha Hs.250505 AI806984 204198_s_atrunt-related transcription factor 3 Hs.170019 AA541630 204304_s_atprominin-like 1 (mouse) Hs.112360 NM_006017.1 205225_at estrogenreceptor 1 Hs.1657 NM_000125.1 205471_s_at dachshund homolog(Drosophila) Hs.63931 AW772082 206378_at secretoglobin, family 2A,member 2 Hs.46452 NM_002411.1 208711_s_at cyclin D1 (PRAD1: parathyroidadenomatosis 1) Hs.82932 BC000076.1 209016_s_at keratin 7 Hs.23881BC002700.1 209290_s_at nuclear factor I/B Hs.33287 BC001283.1 209292_atinhibitor of DNA binding 4, dominant negative Hs.34853 NM_001546.1helix-loop-helix protein 209351_at keratin 14 (epidermolysis bullosasimplex, Hs.117729 BC002690.1 Dowling-Meara, Koebner) 209398_s_atchitinase 3-like 1 (cartilage glycoprotein-39) Hs.75184 M80927.1209465_x_at pleiotrophin (heparin binding growth factor 8, Hs.44AL565812 neurite growth-promoting factor 1) 209863_s_at tumor proteinp63 Hs.137569 AF091627.1 211538_s_at heat shock 70 kD protein 2 Hs.75452U56725.1 211726_s_at flavin containing monooxygenase 2 Hs.132821BC005894.1 211737_x_at pleiotrophin (heparin binding growth factor 8,Hs.44 BC005916.1 neurite growth-promoting factor 1) 211958_at Homosapiens, clone IMAGE: 4183312, Hs.180324 L27560.1 mRNA, partial cds211959_at Homo sapiens, clone IMAGE: 4183312, Hs.180324 L27560.1 mRNA,partial cds 212730_at KIAA0353 protein Hs.10587 AK026420.1 213564_x_atlactate dehydrogenase B Hs.234489 BE042354 216836_s_at v-erb-b2erythroblastic leukemia viral oncogene Hs.323910 X03363.1 homolog 2,neuro/glioblastoma derived oncogene homolog (avian) 217762_s_at RAB31,member RAS oncogene family Hs.223025 BE789881 217838_s_at RNB6 Hs.241471NM_016337.1 218532_s_at hypothetical protein FLJ20152 Hs.82273NM_019000.1 221765_at Homo sapiens mRNA full length insert cDNA Hs.23703BF970427 clone EUROIMAGE 1287006

ER-Subtype II Probe Gene Description UniGene GeneBank 200099_s_at HumanDNA sequence from clone RP11-486O22 on chromosome 10 Hs.307132 AL356115Contains the 3part of a gene for KIAA1128 protein, a novel pseudogene, agene for protein similar to RPS3A (ribosomal protein S3A), ESTs, STSs,GSSs and CpG islands 37892_at collagen, type XI, alpha 1 Hs.82772 J0417739248_at aquaporin 3 Hs.234642 N74607 200606_at desmoplakin (DPI, DPII)Hs.349499 NM_004415.1 200706_s_at LPS-induced TNF-alpha factor Hs.76507NM_004862.1 200749_at RAN, member RAS oncogene family Hs.10842 BF112006200811_at cold inducible RNA binding protein Hs.119475 NM_001280.1200823_x_at ribosomal protein L29 Hs.350068 NM_000992.1 200853_at H2Ahistone family, member Z Hs.119192 NM_002106.1 200925_at cytochrome coxidase subunit Via polypeptide 1 Hs.180714 NM_004373.1 200935_atcalreticulin Hs.16488 NM_004343.2 201054_at heterogeneous nuclearribonucleoprotein A0 Hs.77492 BE966599 201080_atphosphatidylinositol-4-phosphate 5-kinase, type II, beta Hs.6335BF338509 201131_s_at cadherin 1, type 1, E-cadherin (epithelial)Hs.194657 NM_004360.1 201134_x_at cytochrome c oxidase subunit VllcHs.3462 NM_001867.1 201291_s_at topoisomerase (DNA) II alpha (170 kD)Hs.156346 NM_001067.1 201349_at solute carrier family 9 (sodium/hydrogenexchanger), Hs.184276 NM_004252.1 isoform 3 regulatory factor 1201431_s_at dihydropyrimidinase-like 3 Hs.74566 NM_001387.1 201552_atlysosomal-associated membrane protein 1 Hs.150101 NM_005561.2201688_s_at tumor protein D52 Hs.2384 BE974098 201689_s_at tumor proteinD52 Hs.2384 BE974098 201830_s_at neuroepithelial cell transforming gene1 Hs.25155 NM_005863.1 201890_at ribonucleotide reductase M2 polypeptideHs.75319 NM_001034.1 201892_s_at IMP (inosine monophosphate)dehydrogenase 2 Hs.75432 NM_000884.1 201903_at ubiquinol-cytochrome creductase core protein I Hs.119251 NM_003365.1 201925_s_at decayaccelerating factor for complement (CD55, Hs.1369 NM_000574.1 Cromerblood group system) 201946_s_at chaperonin containing TCP1, subunit 2(beta) Hs.6456 AL545982 202071_at syndecan 4 (amphiglycan, ryudocan)Hs.252189 NM_002999.1 202088_at LIV-1 protein, estrogen regulatedHs.79136 AI635449 202291_s_at matrix Gla protein Hs.365706 NM_000900.1202376_at serine (or cysteine) proteinase inhibitor, clade A Hs.234726NM_001085.2 (alpha-1 antiproteinase, antitrypsin), member 3 202489_s_atFXYD domain-containing ion transport regulator 3 Hs.301350 BC005238.1202704_at transducer of ERBB2, 1 Hs.178137 AA675892 203202_at HIV-1 revbinding protein 2 Hs.154762 AI950314 203627_at insulin-like growthfactor 1 receptor Hs.239176 NM_000875.2 203628_at insulin-like growthfactor 1 receptor Hs.239176 NM_000875.2 203789_s_at sema domain,immunoglobulin domain (Ig), short basic Hs.171921 NM_006379.1 domain,secreted, (semaphorin) 3C 203892_at WAP four-disulfide core domain 2Hs.2719 NM_006103.1 203915_at monokine induced by gamma interferonHs.77367 NM_002416.1 203929_s_at Homo sapiens cDNA FLJ31424 fis, cloneNT2NE2000392 Hs.101174 NM_016835.1 203963_at carbonic anhydrase XIIHs.5338 NM_001218.2 204018_x_at hemoglobin, alpha 1 Hs.272572NM_000558.2 204031_s_at poly(rC) binding protein 2 Hs.63525 NM_005016.1204320_at collagen, type XI, alpha 1 Hs.82772 NM_001854.1 204457_s_atgrowth arrest-specific 1 Hs.65029 NM_002048.1 205225_at estrogenreceptor 1 Hs.1657 NM_000125.1 205428_s_at calbindin 2, (29 kD,calretinin) Hs.106857 NM_001740.2 205453_at homeo box B2 Hs.2733NM_002145.1 205887_x_at mutS homolog 3 (E. coli) Hs.42674 NM_002439.1205941_s_at collagen, type X, alpha 1(Schmid metaphyseal Hs.179729AI376003 chondrodysplasia) 206211_at selectin E (endothelial adhesionmolecule 1) Hs.89546 NM_000450.1 206916_x_at tyrosine aminotransferaseHs.161640 NM_000353.1 207721_x_at histidine triad nucleotide bindingprotein 1 Hs.256697 NM_005340.1 208702_x_at amyloid beta (A4)precursor-like protein 2 Hs.279518 BC000373.1 208703_s_at amyloid beta(A4) precursor-like protein 2 Hs.279518 BC000373.1 208711_s_at cyclin D1(PRAD1: parathyroid adenomatosis 1) Hs.82932 BC000076.1 208764_s_at ATPsynthase, H+ transporting, mitochondrial F0 Hs.89399 D13119.1 complex,subunit c (subunit 9), isoform 2 clusterin (complement lysis inhibitor,SP-40, 40, sulfated glycoprotein 2, testosterone-repressed prostatemessage 208791_at 2, apolipoprotein J) clusterin (complement lysisHs.75106 M25915.1 inhibitor, SP-40, 40, sulfated glycoprotein 2,testosterone-repressed prostate message 208792_s_at 2, apolipoprotein J)Hs.75106 M25915.1 208826_x_at histidine triad nucleotide binding protein1 Hs.256697 U27143.1 208950_s_at aldehyde dehydrogenase 7 family, memberA1 Hs.74294 BC002515.1 209035_at midkine (neurite growth-promotingfactor 2) Hs.82045 M69148.1 209069_s_at H3 histone, family 3B (H3.3B)Hs.180877 BC001124.1 209112_at cyclin-dependent kinase inhibitor 1B(p27, Kip1) Hs.238990 BC001971.1 209116_x_at hemoglobin, beta Hs.155376M25079.1 209143_s_at chloride channel, nucleotide-sensitive, 1A Hs.84974AF005422.1 209351_at keratin 14 (epidermolysis bullosa simplex,Hs.117729 BC002690.1 Dowling-Meara, Koebner) 209369_at annexin A3Hs.1378 M63310.1 209403_at hypothetical protein DKFZp434P2235 Hs.105891AL136860.1 209602_s_at GATA binding protein 3 Hs.169946 AI796169210163_at small inducible cytokine subfamily B (Cys-X-Cys), Hs.103982AF030514.1 member 11 210387_at H2B histone family, member A Hs.352109BC001131.1 210511_s_at inhibin, beta A (activin A, activin AB alphaHs.727 M13436.1 polypeptide) 210715_s_at serine protease inhibitor,Kunitz type, 2 Hs.31439 AF027205.1 210764_s_at cysteine-rich, angiogenicinducer, 61 Hs.8867 AF003114.1 211113_s_at ATP-binding cassette,sub-family G (WHITE), Hs.10237 U34919.1 member 1 211404_s_at amyloidbeta (A4) precursor-like protein 2 Hs.279518 BC004371.1 211696_x_athemoglobin, beta Hs.155376 AF349114.1 211745_x_at hemoglobin, alpha 2Hs.347939 BC005931.1 211935_at ADP-ribosylation factor-like 6interacting protein Hs.75249 D31885.1 212328_at KIAA1102 proteinHs.202949 AK027231.1 212492_s_at KIAA0876 protein Hs.301011 AW237172212692_s_at vesicle trafficking, beach and anchor containing Hs.62354W60686 212942_s_at KIAA1199 protein Hs.50081 AB033025.1 212956_atKIAA0882 protein Hs.90419 AB020689.1 3213557_at KIAA0904 proteinHs.278346 AW305119 213764_s_at Microfibril-associated glycoprotein-2Hs.300946 AW665892 213765_at Microfibril-associated glycoprotein-2Hs.300946 AW665892 214079_at Homo sapiens cDNA FLJ20338 fis, cloneHEP12179 Hs.152677 AK000345.1 214414_x_at hemoglobin, alpha 2 Hs.347939T50399 214836_x_at immunoglobulin kappa constant Hs.156110 BG536224215224_at Homo sapiens cDNA: FLJ21547 fis, clone COL06206 Hs.322680AK025200.1 215867_x_at adaptor-related protein complex 1, gamma 1subunit Hs.5344 AL050025.1 217014_s_at Homo sapiens PAC clone RP4-604G5from 7q22-q31.1 Hs.307354 AC004522 217428_s__at collagen, type X, alpha1 (Schmid metaphyseal Hs.179729 X98568 chondrodysplasia) ESTs,Moderately similar to ALU7_HUMAN ALU SUBFAMILY SQ SEQUENCE CONTAMINATIONWARNING 217704_x_at ENTRY [H. sapiens] Hs.310806 AI820796 217753_s_atribosomal protein S26 Hs.299465 NM_001029.1 218237_s_at solute carrierfamily 38, member 1 Hs.18272 NM_030674.1 218302_at uncharacterizedhematopoietic stem/progenitor Hs.54960 NM_018468.1 cells protein MDS033218388_at 6-phosphogluconolactonase Hs.100071 NM_012088.1 218468_s_atcysteine knot superfamily 1, BMP antagonist 1 Hs.40098 AF154054.1218469_at cysteine knot superfamily 1, BMP antagonist 1 Hs.40098NM_013372.1 219087_at asporin (LRR class 1) Hs.10760 NM_017680.1219454_at EGF-like-domain, multiple 6 Hs.12844 NM_015507.2 219734_athypothetical protein FLJ20174 Hs.114556 NM_017699.1 219773_at NADPHoxidase 4 Hs.93847 NM_016931.1 220149_at hypothetical protein FLJ22671Hs.193745 NM_024861.1 220864_s_at cell death-regulatory protein GRIM19Hs.279574 NM_015965.1 221434_s_at hypothetical protein DC50 Hs.324521NM_031210.1 221473_x_at tumor differentially expressed 1 Hs.272168U49188.1 221541_at hypothetical protein DKFZp434B044 Hs.262958AL136861.1 Basal 202342_s_at tripartite motif-containing 2 Hs.12372NM_015271.1 202345_s_at fatty acid binding protein 5(psoriasis-associated) Hs.153179 NM_001444.1 202412_s_at ubiquitinspecific protease 1 Hs.35086 AW499935 203780_at epithelial V-likeantigen 1 Hs.116851 AF275945.1 204580_at matrix metalloproteinase 12(macrophage elastase) Hs.1695 NM_002426.1 205066_s_at ectonucleotidepyrophosphatase/phosphodiesterase 1 Hs.11951 NM_006208.1 206042_x_atSNRPN upstream reading frame Hs.58606 NM_022804.1 206102_at KIAA0186gene product Hs.36232 NM_021067.1 209205_s_at LIM domain only 4 Hs.3844BC003600.1 209212_s_at Kruppel-like factor 5 (intestinal) Hs.84728AB030824.1 209351_at keratin 14 (epidermolysis bullosa simplex,Hs.117729 BC002690.1 Dowling-Meara, Koebner) 212236_x_at keratin 17Hs.2785 Z19574 212592_at Homo sapiens, clone MGC: 24130 IMAGE: 4692359,Hs.76325 AV733266 mRNA, complete cds 213664_at solute carrier family 1(neuronal/epithelial high Hs.91139 AW235061 affinity glutamatetransporter, system Xag), member 1 213668_s_at SRY (sex determiningregion Y)-box 4 Hs.83484 AI989477 213680_at keratin 6B Hs.335952AI831452 217744_s_at p53-induced protein PIGPC1 Hs.303125 NM_022121.1218499_at Mst3 and SOK1-related kinase Hs.23643 NM_016542.1 218593_athypothetical protein FLJ10377 Hs.274263 NM_018077.1 222039_athypothetical protein FLJ11029 Hs.274448 AA292789 ERBB2 55616_athypothetical gene MGC9753 Hs.91668 AI703342 201388_at proteasome(prosome, macropain) 26S subunit, non- Hs.9736 NM_002809.1 ATPase, 3201525_at apolipoprotein D Hs.75736 NM_001647.1 202035_s_at secretedfrizzled-related protein 1 Hs.7306 AI332407 202036_s_at secretedfrizzled-related protein 1 Hs.7306 AF017987.1 202145_at lymphocyteantigen 6 complex, locus E Hs.77667 NM_002346.1 202218_s_at fatty aciddesaturase 2 Hs.184641 NM_004265.1 202376_at serine (or cysteine)proteinase inhibitor, clade A Hs.234726 NM_001085.2 (alpha-1antiproteinase, antitrypsin), member 3 202991_at steroidogenic acuteregulatory protein related Hs.77628 NM_006804.1 203355_s_at KIAA0942protein Hs.6763 NM_015310.1 203404_at armadillo repeat protein ALEX2Hs.48924 NM_014782.1 203439_s_at stanniocalcin 2 Hs.155223 BC000658.1203628_at insulin-like growth factor 1 receptor Hs.239176 NM_000875.2203685_at B-cell CLL/lymphoma 2 Hs.79241 NM_000633.1 204734_at keratin15 Hs.80342 NM_002275.1 204942_s_at aldehyde dehydrogenase 3 family,member B2 Hs.87539 NM_000695.2 205225_at estrogen receptor 1 Hs.1657NM_000125.1 205306_x_at kynurenine 3-monooxygenase (kynurenine3-hydroxylase) Hs.107318 AI074145 206165_s_at chloride channel, calciumactivated, family member 2 Hs.241551 NM_006536.2 206378_atsecretoglobin, family 2A, member 2 Hs.46452 NM_002411.1 207076_s_atargininosuccinate synthetase Hs.160786 NM_000050.1 207131_x_atgamma-glutamyltransferase 1 Hs.284380 NM_013430.1 208180_s_at H4 histonefamily, member H Hs.93758 NM_003543.2 208614_s_at filamin B, beta (actinbinding protein 278) Hs.81008 M62994.1 209016_s_at keratin 7 Hs.23881BC002700.1 209603_at GATA binding protein 3 Hs.169946 AI796169 210163_atsmall inducible cytokine subfamily B (Cys-X-Cys), Hs.103982 AF030514.1member 11 210519_s_at diaphorase (NADHNADPH) (cytochrome b-5 reductase)Hs.80706 BC000906.1 210761_s_at growth factor receptor-bound protein 7Hs.86859 AB008790.1 211138_s_at kynurenine 3-monooxygenase (kynurenine3-hydroxylase) Hs.107318 BC005297.1 211430_s_at immunoglobulin heavyconstant gamma 3 (G3m marker) Hs.300697 M87789.1 gb: L06101.1 /DEF =Human IG VH-region gene, complete cds. /FEA = mRNA /GEN = IGH@ /PROD =immunoglobulin heavy 211641_x_at chain V-region /DB XREF = gi: 185526L06101.1 gb: M85256.1 /DEF = Homo sapiens immunoglobulin kappa-chainVK-1 (IgK) mRNA, complete cds. /FEA = mRNA /GEN = IgK 211645_x_at /PROD= immunoglobulin kappa-chain VK-1 /DB_XREF = M85256.1 gi: 186008 gb:M18728.1 /DEF = Human nonspecific crossreacting antigen mRNA, completecds. /FEA = mRNA /GEN = NCA; NCA; NCA 211657_at /PROD = non-specificcross reacting M18728.1 antigen /DB_XREF = gi: 189084 212218_s_at F-boxonly protein 9 Hs.11050 NM_012347.1 212281_s_at hypothetical proteinHs.199695 L19183.1 214451_at transcription factor AP-2 beta (activatingHs.33102 NM_003221.1 enhancer binding protein 2 beta) 214669_x_at Homosapiens isolate donor N clone N168K Hs.306357 BG485135 immunoglobulinkappa light chain variable region mRNA, partial cds 215176_x_atimmunoglobulin kappa constant Hs.156110 AW404894 216557_x_at Homosapiens mRNA for single-chain antibody, Hs.249245 U92706 complete cds216836_s_at v-erb-b2 erythroblastic leukemia viral oncogene Hs.323910X03363.1 homolog 2, neuro/glioblastoma derived oncogene homolog (avian)217157_x_at Homo sapiens isolate donor N clone N8K Hs.247911 AF103530.1immunoglobulin kappa light chain variable region mRNA, partial cds217388_s_at kynureninase (L-kynurenine hydrolase) Hs.169139 D55639.1217480_x_at Human kappa-immunoglobulin germline pseudogene Hs.278448M20812 (cos118) variable region (subgroup V kappa I) 219768_athypothetical protein FLJ22418 Hs.36583 NM_024626.1 220038_atserum/glucocorticoid regulated kinase-like Hs.279696 NM_013257.1Normal/Normal-like 201030_x_at lactate dehydrogenase B Hs.234489NM_002300.1 201792_at AE binding protein 1 Hs.118397 NM_001129.2201860_s_at plasminogen activator, tissue Hs.274404 NM_000930.1202037_s_at secreted frizzled-related protein 1 Hs.7306 NM_003012.2202218_s_at fatty acid desaturase 2 Hs.184641 NM_004265.1 202662_s_atinositol 1,4,5-triphosphate receptor, type 2 Hs.238272 NM_002223.1202746_at integral membrane protein 2A Hs.17109 AL021786 202887_s_atHIF-1 responsive RTP801 Hs.111244 NM_019058.1 203058_s_at3′-phosphoadenosine 5′-phosphosulfate Hs.274230 AW299958 synthase 2203213_at cell division cycle 2, G1 to S and G2 to M Hs.334562 AL524035203325_s_at collagen, type V, alpha 1 Hs.146428 AI130969 203685_atB-cell CLL/lymphoma 2 Hs.79241 NM_000633.1 203706_s_at frizzled homolog7 (Drosophila) Hs.173859 NM_003507.1 203755_at BUB1 budding uninhibitedby benzimidazoles 1 homolog Hs.36708 NM_001211.2 beta (yeast)203789_s_at sema domain, immunoglobulin domain (Ig), short basicHs.171921 NM_006379.1 domain, secreted, (semaphorin) 3C 203878_s_atmatrix metalloproteinase 11 (stromelysin 3) Hs.155324 NM_005940.2203915_at monokine induced by gamma interferon Hs.77367 NM_002416.1204033_at thyroid hormone receptor interactor 13 Hs.6566 NM_004237.1204602_at dickkopf homolog 1 (Xenopus laevis) Hs.40499 NM_012242.1204731_at transforming growth factor, beta receptor III Hs.342874NM_003243.1 (betaglycan, 300 kD) 205034_at cyclin E2 Hs.30464NM_004702.1 205239_at amphiregulin (schwannoma-derived growth factor)Hs.270833 NM_001657.1 207714_s_at serine (or cysteine) proteinaseinhibitor, clade H Hs.241579 NM_004353.1 (heat shock protein 47), member1, (collagen binding protein 1) gb: NM_018407.1 /DEF = Homo sapiensputative integral membrane transporter (LC27), mRNA. /FEA = mRNA208029_s_at /GEN = LC27 /PROD = putative integral NM_018407.1 membranetransporter /DB_XREF = gi: 8923827 clusterin (complement lysisinhibitor, SP-40, 40, sulfated glycoprotein 2, testosterone-repressedprostate message 2, 208791_at apolipoprotein J) clusterin (complementlysis Hs.75106 M25915.1 inhibitor, SP-40, 40, sulfated glycoprotein 2,testosterone-repressed prostate message 2, 208792_s_at apolipoprotein J)Hs.75106 M25915.1 209071_s_at regulator of G-protein signalling 5Hs.24950 AF159570.1 209218_at squalene epoxidase Hs.71465 AF098865.1209291_at inhibitor of DNA binding 4, dominant negative Hs.34853NM_001546.1 helix-loop-helix protein 209292_at Inhibitor of DNA binding4, dominant negative Hs.34853 NM_001546.1 helix-loop-helix protein209465_x_at pleiotrophin (heparin binding growth factor 8, neurite Hs.44AL565812 growth-promoting factor 1) 209687_at stromal cell-derivedfactor 1 Hs.237356 U19495.1 210519_s_at diaphorase (NADHNADPH)(cytochrome b-5 reductase) Hs.80706 BC000906.1 gb: M18728.1 /DEF = Humannonspecific crossreacting antigen mRNA, complete cds. /FEA = mRNA /GEN =NCA; 211657_at NCA; NCA /PROD = non-specific cross reacting M18728.1antigen /DB_XREF = gi: 189084 211737_x_at pleiotrophin (heparin bindinggrowth factor 8, neurite Hs.44 BC005916.1 growth-promoting factor 1)212236_x_at keratin 17 Hs.2785 Z19574 212254_s_at bullous pemphigoidantigen 1 (230/240 kD) Hs.198689 BG253119 212592_at Homo sapiens, doneMGC: 24130 IMAGE: 4692359, mRNA, Hs.76325 AV733266 complete cds212730_at KIAA0353 protein Hs.10587 AK026420.1 214290_s_at H2A histonefamily, member O Hs.795 AA451996 216836_s_at v-erb-b2 erythroblasticleukemia viral oncogene Hs.323910 X03363.1 homolog 2, neuro/glioblastomaderived oncogene homolog (avian) 217428_s_at collagen, type X, alpha 1(Schmid metaphyseal Hs.179729 X98568 chondrodysplasia) 218087_s_atSH3-domain protein 5 (ponsin) Hs.108924 NM_015385.1 219115_s_atinterleukin 20 receptor, alpha Hs.21814 NM_014432.1 219197_s_at CEGP1protein Hs.222399 AI424243 219215_s_at solute carrier family 39 (zinctransporter), Hs.352415 NM_017767.1 member 4 219304_s_at spinalcord-derived growth factor-B Hs.112885 NM_025208.1 219768_athypothetical protein FLJ22418 Hs.36563 NM_024626.1 220038_atserum/glucocorticoid regulated kinase-like Hs.279696 NM_013257.1222155_s_at hypothetical protein FLJ11856 Hs.6459 AK021918.1

TABLE 6b 2 Optimal Predictor Sets Using the GA/MLHD Algorithm Probe GeneUnigene GeneBank Gene set 1 200926_at ribosomal protein S23 Hs.3463NM_001025.1 205225_at estrogen receptor 1 Hs.1657 NM_000125.1 200670_atX-box binding protein 1 Hs.149923 NM_005080.1 208248_(—) amyloid beta(A4) Hs.279518 NM_001642.1 x_at precursor-like protein 2 209343_athypothetical protein Hs.24391 BC002449.1 FLJ13612 213399_(—) ribophorinII Hs.75722 AI560720 x_at 214938_(—) high-mobility group Hs.274472AF283771.2 x_at (nonhistone chromosomal) protein 1 207783_(—)hypothetical protein Hs.326456 NM_017627.1 x_at FLJ20030 204533_at smallinducible cytokine Hs.2248 NM_001565.1 subfamily B (Cys-X-Cys), member10 204798_at v-myb myeloblastosis Hs.1334 NM_005375.1 viral oncogenehomolog (avian) 212790_(—) ribosomal protein L13a Hs.119122 BF942308x_at 217276_(—) serine hydrolase-like Hs.301947 AL590118.1 x_at213975_(—) tudor repeat associator Hs.283761 AV711904 s_at with PCTAIRE2 202428_(—) diazepam binding Hs.78888 NM_020548.1 x_at inhibitor (GABAreceptor modulator, acyl-Coenzyme A binding protein) 200925_atcytochrome c oxidase Hs.180714 NM_004373.1 subunit Via polypeptide 1Gene set 2 221729_at collagen, type V, alpha 2 Hs.82985 NM_000393.1206461_(—) metallothionein 1H Hs.2667 NM_005951.1 x_at 205509_atcarboxypeptidase B1 Hs.180884 NM_001871.1 (tissue) 212320_at tubulin,beta polypeptide Hs.179661 BC001002.1 209043_at 3′-phosphoadenosineHs.3833 AF033026.1 5′-phosphosulfate synthase 1 200032_(—) ribosomalprotein L9 Hs.157850 NM_000661.1 s_at 202088_at LIV-1 protein, estrogenHs.79136 AI635449 regulated 209604_(—) GATA binding protein 3 Hs.169946BC003070.1 s_at 201892_(—) IMP (inosine monophos- Hs.75432 NM_000884.1s_at phate) dehydrogenase 2 211896_(—) decorin Hs.76152 AF138302.1 s_at201952_at activated leucocyte cell Hs.10247 NM_001627.1 adhesionmolecule 216836_(—) v-erb-b2 erythroblastic Hs.323910 X03363.1 s_atleukemia viral oncogene homolog 2, neuro/glio- blastoma derived oncogenehomolog (avian)

TABLE 7 Up Regulated in luminal D Gene Name Title Unigene_AccessionSeq_Derived_From 201422_at interferon, gamma-inducible protein 30Hs.14623 NM_006332.1 201577_at non-metastatic cells 1, protein (NM23A)expressed in Hs.118638 NM_000269.1 201884_at carcinoembryonicantigen-related cell adhesion molecule 5 Hs.220529 NM_004363.1201946_s_at chaperonin containing TCP1, subunit 2 (beta) Hs.6456AL545982 202433_at UDP-galactose transporter related Hs.154073NM_005827.1 202779_s_at ubiquitin carrier protein Hs.174070 NM_014501.1203628_at insulin-like growth factor 1 receptor Hs.239176 NM_000875.2204566_at protein phosphatase 1D magnesium-dependent, delta isoformHs.100980 NM_003620.1 204868_at immature colon carcinoma transcript 1Hs.9078 NM_001545.1 211762_s_at karyopherin alpha 2 (RAG cohort 1,importin alpha 1) Hs.159557 BC005978.1 211958_at Homo sapiens, cloneIMAGE: 4183312, mRNA, partial cds Hs.180324 L27560.1 211959_at Homosapiens, clone IMAGE: 4183312, mRNA, partial cds Hs.180324 L27560.1217755_at hematological and neurological expressed 1 Hs.109706NM_016185.1 218585_s_at RA-regulated nuclear matrix-associated proteinHs.126774 NM_016448.1 218732_at CGI-147 protein Hs.12677 NM_016077.1219493_at hypothetical protein FLJ22009 Hs.123253 NM_024745.1 222039_athypothetical protein FLJ11029 Hs.274448 AA292789 222231_s_athypothetical protein PRO1855 Hs.283558 AK025328.1 Down Regulated inluminal D Gene Name Title Unigene_Accession [A] Seq_Derived_From201667_at gap junction protein, alpha 1, 43kD (connexin 43) Hs.74471NM_000165.2 201939_at serum-inducible kinase Hs.3838 NM_006622.1202291_s_at matrix Gla protein Hs.365706 NM_000900.1 203143_s_atKIAA0040 gene product Hs.158282 T79953 203892_at WAP four-disulfide coredomain 2 Hs.2719 NM_006103.1 203917_at coxsackie virus and adenovirusreceptor Hs.79187 NM_001338.1 204942_s_at aldehyde dehydrogenase 3family, member B2 Hs.87539 NM_000695.2 205381_at 37 kDa leucine-richrepeat (LRR) protein Hs.155545 NM_005824.1 205590_at RAS guanylreleasing protein 1 (calcium and DAG-regulated) Hs.182591 NM_005739.2208798_x_at golgin-67 Hs.182982 AF204231.1 209189_at v-fos FBJ murineosteosarcoma viral oncogene homolog Hs.25647 BC004490.1 212708_at Homosapiens mRNA; cDNA DKFZp586B1922 (from clone DKFZp586B1922) Hs.184779AV721987 212927_at KIAA0594 protein Hs.103283 AB011166.1 213089_at ESTs,Highly similar to T17212 hypothetical protein DKFZp434P211.1 Hs.352339AU158490 [H. sapiens] 213605_s_at Homo sapiens mRNA; cDNA DKFZp564F112(from clone DKFZp564F112) Hs.166361 AL049987.1 214020_x_at integrin,beta 5 Hs.149846 AI335208 214053_at Homo sapiens clone 23736 mRNAsequence Hs.7888 AW772192 214218_s_at Homo sapiens cDNA FLJ30298 fis,clone BRACE2003172 Hs.351546 AV699347 214657_s_at multiple endocrineneoplasia I Hs.240443 AU134977 214705_at PDZ domain protein (DrosophilainaD-like) Hs.321197 AJ001306.1 215071_s_at H2A histone family, member LHS.28777 AL353759 215470_at Human chromosome 5q13.1 clone 5G8 mRNAHs.14658 U21915.1 217838_s_at RNB6 Hs.241471 NM_016337.1 218312_s_athypothetical protein FLJ12895 Hs.235390 NM_023926.1 218330_s_at retinoicacid inducible in neuroblastoma Hs.23467 NM_018162.1 218344_s_athypothetical protein FLJ10876 Hs.94042 NM_018254.1 218398_atmitochondrial ribosomal protein S30 Hs.28555 NM_016640.1

1. A method of creating an expression profile characteristic of a breasttumor cell, said method comprising the steps of (a) isolating expressionproducts from said breast tumor cell and a normal breast cell; (b)contacting said expression products for both the tumor and normal breastcell with a plurality of binding members capable of specifically bindingto expression products of at least 10 genes selected from Table 2; so asto create an expression profile of those genes for both the tumor celland the normal cell; (c) comparing the expression profile of the tumorcell and the normal cell; and (d) determining an expression profilecharacteristic of a breast tumor cell. 2-66. (canceled)
 67. The methodas set forth in claim 1 wherein the binding members are capable ofspecifically and independently binding to each of the genes provided inTable
 2. 68. The method as set forth in claim 67 wherein the expressionproduct is a polypeptide.
 69. The method as set forth in claim 68wherein the binding members are antibody binding domains.
 70. The methodas set forth in claim 67 wherein the expression product is mRNA or cDNA.71. The method as set forth in claim 70 wherein the binding members arenucleic acid probes.
 72. The method as set forth in claim 71 wherein thebinding members are labelled.
 73. The method as set forth in claim 70wherein the expression products are labelled.
 74. A method of creatingan expression profile characteristic of a breast tumor cell, said methodcomprising the steps of (a) isolating expression products from a breasttumor cell, contacting said expression products with a plurality ofbinding members capable of specifically and independently binding toexpression products of at least 10 genes selected from Table 2; so as tocreate a first expression profile of a tumor cell; (b) isolatingexpression products from a normal breast cell; contacting saidexpression products with the plurality of binding members as used instep (a), so as to create a comparable second expression profile of anormal breast cell; and (c) comparing the first and second expressionprofiles to determine an expression profile characteristic of a breasttumor cell.
 75. The method as set forth in claim 74 wherein the bindingmembers are capable of specifically and independently binding to each ofthe genes provided in Table
 2. 76. The method as set forth in claim 75wherein the expression product is a polypeptide.
 77. The method as setforth in claim 76 wherein the binding members are antibody bindingdomains.
 78. The method as set forth in claim 75 wherein the expressionproduct is mRNA or cDNA.
 79. The method as set forth in claim 78 whereinthe binding members are nucleic acid probes.
 80. The method as set forthin claim 79 wherein the binding members are labelled.
 81. The method asset forth in claim 78 wherein the expression products are labelled. 82.A method of creating a nucleic acid expression profile characteristic ofa breast tumor cell, said method comprising the steps of (a) isolatingexpression products from a first breast tumor cell, contacting saidexpression products with a plurality of binding members capable ofspecifically and independently binding to expression products of atleast 10 genes selected from Table 2, so as to create a first expressionprofile; (b) repeating step (a) with expression products from at least asecond breast tumor cell so as to create at least a second expressionprofile; (c) comparing the at least first and second expression profilesto create a standard nucleic acid expression profile characteristic of abreast tumor cell.
 83. The method as set forth in claim 82 wherein theisolated expression products are contacted with a plurality of bindingmembers capable of specifically and independently binding to expressionproducts of each of the genes provided in Table
 2. 84. The method as setforth in claim 83 wherein the expression product is a polypeptide. 85.The method as set forth in claim 84 wherein the binding members areantibody binding domains.
 86. The method as set forth in claim 83wherein the expression product is mRNA or cDNA.
 87. The method as setforth in claim 86 wherein the binding members are nucleic acid probes.88. The method as set forth in claim 87 wherein the binding members arelabelled.
 89. The method as set forth in claim 86 wherein the expressionproducts are labelled.
 90. A method for determining the presence or riskof breast cancer in an individual, said method comprising (a) obtainingexpression products from a breast tissue cell obtained from anindividual suspected of having or at risk from having breast cancer; (b)contacting said expression products with binding members capable ofspecifically and independently binding to expression productscorresponding to at least 10 of the genes identified in Table 2; and (c)determining the presence or risk of breast cancer in said individualbased on the binding of the expression products from said breast
 91. Themethod as set forth in claim 90 wherein the expression products arecontacted with binding members are capable of specifically andindependently binding to expression products corresponding to each ofthe genes identified in Table
 2. 92. The method as set forth in claim 91wherein the determination of the presence or risk of breast cancer insaid individual is carried out by comparing the binding of theexpression products from the breast tissue cell under test with anexpression profile characteristic of breast tumor cell.
 93. The methodas set forth in claim 92 wherein the individual is of Asian descent. 94.A method of creating a nucleic acid expression profile characteristic ofa breast tumor cell, said method comprising the steps of (a) isolatingexpression products from said breast tumor cell and a normal breastcell; (b) contacting said expression products for both the tumor andnormal breast cell with a plurality of binding members capable ofspecifically binding to expression products of at least 10 genesselected from Table 4a; so as to create an expression profile of thosegenes for both the tumor cell and the normal cell; (c) comparing theexpression profile of the tumor cell and the normal cell; and (d)determining a nucleic acid expression profile characteristic of breasttumor cell.
 95. The method as set forth in claim 94 wherein the isolatedexpression products are contacted with a plurality of binding memberscapable of specifically and independently binding to expression productsof at least 10 genes selected from Table 4b.
 96. The method as set forthin claim 95 wherein the binding expression product is mRNA or cDNA. 97.The method as set forth in claim 95 wherein the binding members arenucleic acid probes.
 98. The method as set forth in claim 95 wherein theexpression product is a polypeptide.
 99. The method as set forth inclaim 98 wherein the binding members are antibody binding domains. 100.The method as set forth in claim 99 wherein the binding members arelabelled.
 101. The method as set forth in claim 99 wherein theexpression products are labelled.
 102. A method of creating a nucleicacid expression profile characteristic of a breast tumor cell, saidmethod comprising the steps of (a) isolating expression products from abreast tumor cell; contacting said expression products with a pluralityof binding members capable of specifically and independently binding toexpression products of at least 10 genes selected from Table 4a; so asto create a first expression profile of a tumor cell; (b) isolatingexpression products from a normal breast cell; contacting saidexpression products with the plurality of binding members as used instep (a); so as to create a comparable second expression profile of anormal breast cell; (c) comparing the first and second expressionprofiles to determine an expression profile characteristic of a breasttumor cell.
 103. The method as set forth in claim 102 wherein theisolated expression products are contacted with a plurality of bindingmembers capable of specifically and independently binding to expressionproducts of at least 10 genes selected from Table 4b.
 104. The method asset forth in claim 102 wherein the isolated expression products arecontacted with a plurality of binding members capable of specificallyand independently binding to expression products of at least twentygenes selected from Table 4a.
 105. The method as set forth in claim 102wherein the binding expression product is mRNA or cDNA.
 106. The methodas set forth in claim 102 wherein the binding members are nucleic acidprobes.
 107. The method as set forth in claim 102 wherein the expressionproduct is a polypeptide.
 108. The method as set forth in claim 107wherein the binding members are antibody binding domains.
 109. Themethod as set forth in claim 107 wherein the binding members arelabelled.
 110. The method as set forth in claim 107 wherein theexpression products are labelled.
 111. A method for determining thepresence or risk of breast cancer in an individual, said methodcomprising (a) obtaining expression products from a breast tissue cellobtained from an individual suspected of having or at risk from havingbreast cancer; (b) contacting said expression products with bindingmembers capable of binding to expression products corresponding to atleast 10 genes identified in Table 4a; and (c) determining the presenceor risk of breast cancer in said individual based on the binding of theexpression products from said breast tissue cell to one or more of thebinding members.
 112. The method as set forth in claim 111 wherein thedetermination of the presence or risk of breast cancer is computed usingan algorithm which distinguishes a tumor cell from normal cell by theirrespective expression profiles.
 113. The method as set forth in claim111 wherein the determination of the presence or risk of breast cancerin said individual is carried out by comparing the binding of theexpression products from the breast tissue cell under test with anexpression profile characteristic of breast tumor cell.
 114. The methodas set forth in claim 111 wherein the expression products are contactedwith a plurality of binding members are capable of binding to expressionproducts of at least twenty genes selected from Table 4a.
 115. Themethod as set forth in claim 114 wherein the determination of thepresence or risk of breast cancer is computed using an algorithm whichdistinguishes a tumor cell from normal cell by their respectiveexpression profiles.
 116. The method as set forth in claim 114 whereinthe determination of the presence or risk of breast cancer in saidindividual is carried out by comparing the binding of the expressionproducts from the breast tissue cell under test with an expressionprofile characteristic of breast tumor cell.
 117. The method as setforth in claim 111 wherein the expression products are contacted with aplurality of binding members are capable of binding to expressionproducts of at least 10 genes identified in Table 4b.
 118. The method asset forth in claim 117 wherein the determination of the presence or riskof breast cancer is computed using an algorithm which distinguishes atumor cell from normal cell by their respective expression profiles.119. The method as set forth in claim 117 wherein the determination ofthe presence or risk of breast cancer in said individual is carried outby comparing the binding of the expression products from the breasttissue cell under test with an expression profile characteristic ofbreast tumor cell.
 120. A method of obtaining a plurality of geneexpression profiles in order to determine a standard expression profilecharacteristic of presence and/or type of breast cancer, said methodcomprising a) obtaining cells from a plurality of breast tumor sample;b) disrupting said cells to expose gene expression products; c)contacting said gene expression products with a plurality of bindingmembers specific for expression products of at least 10 genes selectedfrom Table 2; and d) determining a gene expression profilecharacteristic of the presence and/or type of breast cancer based on thebinding of said expression products to said binding members for each ofsaid plurality of breast tumor samples.
 121. The method as set forth inclaim 120 further comprising the step of producing a database containinga plurality of expression profiles obtained from said plurality ofbreast tumor samples.
 122. The method as set forth in claim 120 furthercomprising the step of determining the statistical variation between theplurality of expression profiles.
 123. A method of obtaining a pluralityof gene expression profiles in order to determine a standard expressionprofile characteristic of presence and/or type of breast cancer, saidmethod comprising a) obtaining cells from a plurality of breast tumorsample; b) disrupting said cells to expose gene expression products; c)contacting said gene expression products with a plurality of bindingmembers specific for expression products of at least 10 genes selectedfrom Table 4a; and d) determining a gene expression profilecharacteristic of the presence and/or type of breast cancer based on thebinding of said expression products to said binding members for each ofsaid plurality of breast tumor samples.
 124. The method as set forth inclaim 123 further comprising the step of producing a database containinga plurality of expression profiles obtained from said plurality ofbreast tumor samples.
 125. The method as set forth in claim 123 furthercomprising the step of determining the statistical variation between theplurality of expression profiles.
 126. A database comprising expressionprofiles characteristic of breast cancer or type of breast cancerproduced by the method as set forth in claim
 125. 127. The database asset forth in claim 126 wherein the expression profiles are nucleic acidexpression profiles.
 128. The database as set forth in claim 126 whereinthe expression profiles are protein expression profiles.
 129. A methodof obtaining a plurality of gene expression profiles in order todetermine a standard expression profile characteristic of presenceand/or type of breast cancer, said method comprising a) obtaining cellsfrom a plurality of breast tumor sample; b) disrupting said cells toexpose gene expression products; c) contacting said gene expressionproducts with a plurality of binding members specific for expressionproducts of at least 10 genes selected from Table 4b; and d) determininga gene expression profile characteristic of the presence and/or type ofbreast cancer based on the binding of said expression products to saidbinding members for each of said plurality of breast tumor samples. 130.The method as set forth in claim 129 further comprising the step ofproducing a database containing a plurality of expression profilesobtained from said plurality of breast tumor samples.
 131. The method asset forth in claim 129 further comprising the step of determining thestatistical variation between the plurality of expression profiles. 132.A database comprising expression profiles characteristic of breastcancer or type of breast cancer produced by the method as set forth inclaim
 131. 133. The database as set forth in claim 132 wherein theexpression profiles are nucleic acid expression profiles.
 134. Thedatabase as set forth in claim 132 wherein the expression profiles areprotein expression profiles.
 135. A method of obtaining a plurality ofgene expression profiles in order to determine a standard expressionprofile characteristic of presence and/or type of breast cancer, saidmethod comprising a) obtaining cells from a plurality of breast tumorsample; b) disrupting said cells to expose gene expression products; c)contacting said gene expression products with a plurality of bindingmembers specific for expression products of at least 10 genes selectedfrom Table 5; and d) determining a gene expression profilecharacteristic of the presence and/or type of breast cancer based on thebinding of said expression products to said binding members for each ofsaid plurality of breast tumor samples.
 136. The method as set forth inclaim 135 further comprising the step of producing a database containinga plurality of expression profiles obtained from said plurality ofbreast tumor samples.
 137. The method as set forth in claim 135 furthercomprising the step of determining the statistical variation between theplurality of expression profiles.
 138. A database comprising expressionprofiles characteristic of breast cancer or type of breast cancerproduced by the method as set forth in claim
 137. 139. The database asset forth in claim 138 wherein the expression profiles are nucleic acidexpression profiles.
 140. The database as set forth in claim 138 whereinthe expression profiles are protein expression profiles.
 141. A methodof obtaining a plurality of gene expression profiles in order todetermine a standard expression profile characteristic of presenceand/or type of breast cancer, said method comprising a) obtaining cellsfrom a plurality of breast tumor sample; b) disrupting said cells toexpose gene expression products; c) contacting said gene expressionproducts with a plurality of binding members specific for expressionproducts of at least 10 genes selected from Table 6a; and d) determininga gene expression profile characteristic of the presence and/or type ofbreast cancer based on the binding of said expression products to saidbinding members for each of said plurality of breast tumor samples. 142.The method as set forth in claim 141 further comprising the step ofproducing a database containing a plurality of expression profilesobtained from said plurality of breast tumor samples.
 143. The method asset forth in claim 141 further comprising the step of determining thestatistical variation between the plurality of expression profiles. 144.A database comprising expression profiles characteristic of breastcancer or type of breast cancer produced by the method as set forth inclaim
 143. 145. The database as set forth in claim 144 wherein theexpression profiles are nucleic acid expression profiles.
 146. Thedatabase as set forth in claim 144 wherein the expression profiles areprotein expression profiles.
 147. A method of obtaining a plurality ofgene expression profiles in order to determine a standard expressionprofile characteristic of presence and/or type of breast cancer, saidmethod comprising a) obtaining cells from a plurality of breast tumorsample; b) disrupting said cells to expose gene expression products; c)contacting said gene expression products with a plurality of bindingmembers specific for expression products of at least 10 genes selectedfrom Table 7; and d) determining a gene expression profilecharacteristic of the presence and/or type of breast cancer based on thebinding of said expression products to said binding members for each ofsaid plurality of breast tumor samples.
 148. The method as set forth inclaim 147 further comprising the step of producing a database containinga plurality of expression profiles obtained from said plurality ofbreast tumor samples.
 149. The method as set forth in claim 147 furthercomprising the step of determining the statistical variation between theplurality of expression profiles.
 150. A database comprising expressionprofiles characteristic of breast cancer or type of breast cancerproduced by the method as set forth in claim
 149. 151. The database asset forth in claim 150 wherein the expression profiles are nucleic acidexpression profiles.
 152. The database as set forth in claim 150 whereinthe expression profiles are protein expression profiles.
 153. A methodof obtaining a plurality of gene expression profiles in order todetermine a standard expression profile characteristic of presenceand/or type of breast cancer, said method comprising a) obtaining cellsfrom a plurality of breast tumor sample; b) disrupting said cells toexpose gene expression products; c) contacting said gene expressionproducts with a plurality of binding members capable of specifically andindependently binding to expression products of the genes identified inTable 6b; d) determining a gene expression profile characteristic of thepresence and/or type of breast cancer based on the binding of saidexpression products to said binding members for each of said pluralityof breast tumor samples.
 154. The method as set forth in claim 153further comprising the step of producing a database containing aplurality of expression profiles obtained from said plurality of breasttumor samples.
 155. The method as set forth in claim 153 furthercomprising the step of determining the statistical variation between theplurality of expression profiles.
 156. A database comprising expressionprofiles characteristic of breast cancer or type of breast cancerproduced by the method as set forth in claim
 155. 157. The database asset forth in claim 156 wherein the expression profiles are nucleic acidexpression profiles.
 158. The database as set forth in claim 156 whereinthe expression profiles are protein expression profiles.
 159. A methodfor classifying a breast tumor cell on the basis of Estrogen receptor(ER) status, said method comprising (a) obtaining expression productsfrom a breast tumor cell; (b) contacting said expression products withbinding members capable of binding to expression products correspondingto the genes identified in Table 5a; and (c) classifying the breasttumor on the basis of ER status based on the binding of the expressionproducts from said breast tumor cell to one or more of the bindingmembers.
 160. A method for classifying a breast tumor cell on the basisof ERBB2 status, said method comprising (a) obtaining expressionproducts from a breast tumor cell; (b) contacting said expressionproducts with binding members capable of binding to expression productscorresponding to the genes identified in Table 5b; and (c) classifyingthe breast tumor on the basis of ERBB2 status based on the binding ofthe expression products from said breast tumor cell to one or more ofthe binding members.
 161. A method for classifying a breast tumor cellon the basis of its molecular subtype, said method comprising (a)obtaining expression products from a breast tumor cell; (b) contactingsaid expression products with binding members capable of binding toexpression products corresponding to at least 10 genes identified inTable 6a; and (c) classifying the tumor cell with regard to itsmolecular subtype based on the binding profile of the expressionproducts from the tumor cell and the binding members.
 162. The method asset forth in claim 161 wherein the binding members are capable ofspecifically and independently binding to at least twenty genesidentified in Table 6a.
 163. The method as set forth in claim 162wherein the molecular subtypes are selected from Luminal, ERBB2, Basal,ER-type II and normal/normal-like.
 164. The method as set forth in claim161 wherein the binding members are capable of specifically andindependently binding to at least the genes identified in Table 6b. 165.The method as set forth in claim 164 wherein the molecular subtypes areselected from Luminal, ERBB2, Basal, ER-type II and normal/normal-like.166. A method for classifying a breast tumor cell on the basis of itsLuminal sub-class, said method comprising (a) obtaining expressionproducts from a breast tumor cell; (b) contacting said expressionproducts with binding members capable of binding to expression productscorresponding to at least 10 genes identified in Table 7; and (c)classifying the tumor cell with regard to its Luminal sub-class based onthe binding profile of the expression products from the tumor cell andthe binding members.
 167. The method as set forth in claim 166 whereinsaid tumor cell has been previously classified as a Luminal molecularsubtype.
 168. The method as set forth in claim 167 wherein the Luminalsub-class is Luminal D or Luminal A.
 169. A diagnostic tool comprising aplurality of binding members capable of specifically and independentlybinding to expression products of at least 10 genes selected from Table4a, said plurality of binding members being fixed to a solid support.170. The diagnostic tool as set forth in claim 169 wherein said bindingmembers are cDNA or oligonucleotides.
 171. A diagnostic tool comprisinga plurality of binding members capable of specifically and independentlybinding to expression products of at least 10 genes selected from Table4b, said plurality of binding members being fixed to a solid support.172. The diagnostic tool as set forth in claim 171 wherein said bindingmembers are cDNA or oligonucleotides.
 173. A diagnostic tool comprisinga plurality of binding members capable of specifically and independentlybinding to expression products of at least 10 genes selected from Table5a, said plurality of binding members being fixed to a solid support.174. The diagnostic tool as set forth in claim 173 wherein said bindingmembers are cDNA or oligonucleotides.
 175. A diagnostic tool comprisinga plurality of binding members capable of specifically and independentlybinding to expression products of at least 10 genes selected from Table5b, said plurality of binding members being fixed to a solid support.176. The diagnostic tool as set forth in claim 175 wherein said bindingmembers are cDNA or oligonucleotides.
 177. A diagnostic tool comprisinga plurality of binding members capable of specifically and independentlybinding to expression products of at least 10 genes selected from Table6a, said plurality of binding members being fixed to a solid support.178. The diagnostic tool as set forth in claim 177 wherein said bindingmembers are cDNA or oligonucleotides.
 179. A diagnostic tool comprisinga plurality of binding members capable of specifically and independentlybinding to expression products of at least 10 genes selected from Table7, said plurality of binding members being fixed to a solid support.180. The diagnostic tool as set forth in claim 179 wherein said bindingmembers are cDNA or oligonucleotides.
 181. A diagnostic tool comprisinga plurality of binding members capable of specifically and independentlybinding to expression products of the genes identified in Table 6b, saidplurality of binding members being fixed to a solid support.
 182. Thediagnostic tool as set forth in claim 181 wherein said binding membersare cDNA or oligonucleotides.